from pyspark.sql import SparkSession★Entry point for the DataFrame & SQL API.from pyspark.sql import functions as F, types as T★The two imports nearly every script needs.spark = SparkSession.builder.appName('app').getOrCreate()★Create (or reuse) the active session..master('local[*]').config('k','v')Chain beforegetOrCreate()to tune the session.spark.sparkContext / spark.versionUnderlying SparkContext / installed version.spark.stop()Release cluster resources when done.
spark.conf.set('spark.sql.shuffle.partitions', 200)★Set a runtime SQL/shuffle config.spark.conf.get('spark.sql.adaptive.enabled')Read a config value (AQE on by default in 3.2+).spark.catalog.listTables('db')List tables/views registered in the catalog.spark.catalog.listDatabases()List available databases.spark.catalog.dropTempView('t')Remove a temp view.spark.catalog.clearCache()Drop all cached tables from memory.
spark.read.csv(path, header=True, inferSchema=True)★Read CSV; infer types or pass a schema.spark.read.parquet(path)★Read columnar Parquet — the Spark-native default.spark.read.json(path)Read line-delimited or multiline JSON.spark.read.schema(schema).csv(path)★Explicit schema — faster & safer than inferring.spark.read.format('jdbc').options(**o).load()Read from a relational database.spark.read.load(path, format='orc')Generic loader (orc, delta, avro, text…).
df.write.parquet(path)★Write columnar Parquet — fast, typed, compressed.df.write.mode('overwrite').save(path)★mode: overwrite, append, ignore, error.df.write.partitionBy('col').parquet(path)★Hive-style partitioned directory tree.df.write.bucketBy(8, 'id').saveAsTable('t')Bucketed managed table (avoids join shuffles).df.write.format('jdbc').options(**o).save()Write to a relational database.
spark.createDataFrame(data, schema)★From a list of Rows/tuples, plus a schema.spark.createDataFrame(pandas_df)★From a pandas DataFrame (needs pyarrow).spark.range(0, 100, 2)Single-columnidDataFrame — handy for demos.T.StructType([T.StructField('n', T.StringType(), True)])★Explicit schema — column, type, nullable.schema = 'name STRING, age INT'★DDL-string schema — concise alternative.
df.show(n, truncate=False)★Print the first n rows (default 20).df.printSchema()★Tree of column names, types & nullability.df.columns df.dtypes df.schemaNames / (name,type) pairs / full StructType.df.count()★Row count — an action, triggers a full pass.df.describe() / df.summary()Summary stats;summaryadds percentiles.df.explain('formatted')★Logical & physical plan — read shuffles/scans.df.isEmpty() df.rdd.getNumPartitions()Emptiness check / partition count.
df.select('a', 'b')★Project a subset of columns.df.select(F.col('a').alias('x'), F.expr('b+1'))Expressions, renaming & SQL strings.df.selectExpr('a', 'b * 2 as b2')Select using pure SQL expressions.df.filter(df.age > 24) / df.where(...)★filterandwhereare exact aliases.df.filter((df.a>1)&(df.b<9))Combine with& | ~, in parens.df.filter(df.name.isin('Bob','Mike'))★Match any value in a list.df.filter(df.name.like('Al%') | df.n.rlike('^A'))SQL wildcard / regex match.
df.withColumn('new', expr)★Add or replace a column (returns new DataFrame).df.withColumns({'a': e1, 'b': e2})Add several columns in one call (3.3+).df.withColumnRenamed('old', 'new')★Rename a single column.df.drop('c1', 'c2')★Remove one or more columns.df.withColumn('p', df.p.cast('double'))★Cast a column to a new type.F.when(c1, v1).when(c2, v2).otherwise(d)★Vectorized if/elif/else across rows.F.coalesce(c1, c2, F.lit('N/A'))First non-null value across columns.
df.na.fill(0) / df.fillna({'age': 0})★Replace nulls — scalar or per-column dict.df.na.drop(how='any', subset=['a'])★Drop rows with nulls (any/all, subset).df.na.replace(['', 'NA'], None)Swap specific values for null.df.filter(df.c.isNull()) / .isNotNull()★Boolean null checks.F.isnan(col) F.nanvl(col, alt)NaN test / replace NaN (distinct from null).df.dropDuplicates(['a','b']) / df.distinct()★De-duplicate on a subset / whole row.