PySpark Cheat Sheet — Comprehensive Edition

01Setup & SparkSessioncreate

from pyspark.sql import SparkSession★
Entry point for the DataFrame & SQL API.
from pyspark.sql import functions as F, types as T★
The two imports nearly every script needs.
spark = SparkSession.builder.appName('app').getOrCreate()★
Create (or reuse) the active session.
.master('local[*]').config('k','v')
Chain before getOrCreate() to tune the session.
spark.sparkContext / spark.version
Underlying SparkContext / installed version.
spark.stop()
Release cluster resources when done.

02Configuration & Cataloginspect / config

spark.conf.set('spark.sql.shuffle.partitions', 200)★
Set a runtime SQL/shuffle config.
spark.conf.get('spark.sql.adaptive.enabled')
Read a config value (AQE on by default in 3.2+).
spark.catalog.listTables('db')
List tables/views registered in the catalog.
spark.catalog.listDatabases()
List available databases.
spark.catalog.dropTempView('t')
Remove a temp view.
spark.catalog.clearCache()
Drop all cached tables from memory.

03Reading Datacreate

spark.read.csv(path, header=True, inferSchema=True)★
Read CSV; infer types or pass a schema.
spark.read.parquet(path)★
Read columnar Parquet — the Spark-native default.
spark.read.json(path)
Read line-delimited or multiline JSON.
spark.read.schema(schema).csv(path)★
Explicit schema — faster & safer than inferring.
spark.read.format('jdbc').options(**o).load()
Read from a relational database.
spark.read.load(path, format='orc')
Generic loader (orc, delta, avro, text…).

04Writing Datacreate

df.write.parquet(path)★
Write columnar Parquet — fast, typed, compressed.
df.write.mode('overwrite').save(path)★
mode: overwrite, append, ignore, error.
df.write.partitionBy('col').parquet(path)★
Hive-style partitioned directory tree.
df.write.bucketBy(8, 'id').saveAsTable('t')
Bucketed managed table (avoids join shuffles).
df.write.format('jdbc').options(**o).save()
Write to a relational database.

05Creating DataFrames & Schemascreate

spark.createDataFrame(data, schema)★
From a list of Rows/tuples, plus a schema.
spark.createDataFrame(pandas_df)★
From a pandas DataFrame (needs pyarrow).
spark.range(0, 100, 2)
Single-column id DataFrame — handy for demos.
T.StructType([T.StructField('n', T.StringType(), True)])★
Explicit schema — column, type, nullable.
schema = 'name STRING, age INT'★
DDL-string schema — concise alternative.

06Inspect & Exploreinspect

df.show(n, truncate=False)★
Print the first n rows (default 20).
df.printSchema()★
Tree of column names, types & nullability.
df.columns df.dtypes df.schema
Names / (name,type) pairs / full StructType.
df.count()★
Row count — an action, triggers a full pass.
df.describe() / df.summary()
Summary stats; summary adds percentiles.
df.explain('formatted')★
Logical & physical plan — read shuffles/scans.
df.isEmpty() df.rdd.getNumPartitions()
Emptiness check / partition count.

07Selecting & Filteringselect & filter

df.select('a', 'b')★
Project a subset of columns.
df.select(F.col('a').alias('x'), F.expr('b+1'))
Expressions, renaming & SQL strings.
df.selectExpr('a', 'b * 2 as b2')
Select using pure SQL expressions.
df.filter(df.age > 24) / df.where(...)★
filter and where are exact aliases.
df.filter((df.a>1)&(df.b<9))
Combine with & | ~, in parens.
df.filter(df.name.isin('Bob','Mike'))★
Match any value in a list.
df.filter(df.name.like('Al%') | df.n.rlike('^A'))
SQL wildcard / regex match.

08Column Operations & Conditionalsshape & combine

df.withColumn('new', expr)★
Add or replace a column (returns new DataFrame).
df.withColumns({'a': e1, 'b': e2})
Add several columns in one call (3.3+).
df.withColumnRenamed('old', 'new')★
Rename a single column.
df.drop('c1', 'c2')★
Remove one or more columns.
df.withColumn('p', df.p.cast('double'))★
Cast a column to a new type.
F.when(c1, v1).when(c2, v2).otherwise(d)★
Vectorized if/elif/else across rows.
F.coalesce(c1, c2, F.lit('N/A'))
First non-null value across columns.

09Missing Values & Duplicatesclean · df.na

df.na.fill(0) / df.fillna({'age': 0})★
Replace nulls — scalar or per-column dict.
df.na.drop(how='any', subset=['a'])★
Drop rows with nulls (any/all, subset).
df.na.replace(['', 'NA'], None)
Swap specific values for null.
df.filter(df.c.isNull()) / .isNotNull()★
Boolean null checks.
F.isnan(col) F.nanvl(col, alt)
NaN test / replace NaN (distinct from null).
df.dropDuplicates(['a','b']) / df.distinct()★
De-duplicate on a subset / whole row.

10String Functionsclean & transform

F.concat_ws('-', 'a', 'b')★
Join columns with a separator.
F.upper(c) F.lower(c) F.initcap(c)★
Case conversion.
F.trim(c) F.lpad(c, n, '0')
Strip / pad strings.
F.substring(c, pos, len) F.length(c)
Substring / character length.
F.regexp_replace(c, pat, repl)★
Regex find-and-replace.
F.regexp_extract(c, pat, idx)
Pull out a regex capture group.
F.split(c, pat) F.instr(c, sub)
Split to array / index of a substring.

11Date & Timestamp Functionsclean & transform

F.current_date() F.current_timestamp()
Today's date / the current instant.
F.to_date(c, 'yyyy-MM-dd') F.to_timestamp(c)★
Parse strings into date / timestamp.
F.year(c) F.month(c) F.dayofweek(c) F.hour(c)★
Extract calendar parts.
F.date_add(c, n) F.date_sub(c, n)
Shift a date forward / backward n days.
F.datediff(end, start) F.months_between(a,b)
Day / month gaps between dates.
F.date_format(c, 'yyyy-MM') F.date_trunc('month', c)
Format / truncate a timestamp.

12Numeric, Math & Hashingclean & transform

F.round(c, n) F.floor(c) F.ceil(c)
Rounding functions.
F.abs(c) F.sqrt(c) F.exp(c) F.log(c)
Elementwise math.
F.least(*cols) F.greatest(*cols)
Min / max across columns (row-wise).
F.lit(1) F.col('x') F.expr('a+b')★
Literal / column ref / SQL expression.
F.monotonically_increasing_id()
Unique (non-sequential) row id.
F.hash(*cols) F.md5(c) F.sha2(c, 256)
Hashing for keys / anonymization.

13Array, Map & Higher-Ordershape & combine

F.explode(c) / F.posexplode(c)★
One row per element (+ index variant).
F.array(*c) F.size(c) F.array_contains(c, v)
Build array / length / membership.
F.array_distinct(c) F.sort_array(c) F.arrays_zip(a,b)
Dedup / sort / zip arrays.
F.transform(c, lambda x: x+1)★
Higher-order map over an array column.
F.filter(c, lambda x: x>0) F.aggregate(...)
Higher-order filter / fold over an array.
F.from_json(c, schema) F.to_json(c)★
Parse / serialize JSON string columns.

14Joinsshape & combine

df.join(other, 'id', 'inner')★
Join on a shared column name.
df.join(other, df.id==other.pid, 'left')★
Join on an explicit condition.
df.join(other, ['k1','k2'], 'outer')
Join on multiple key columns.
df.join(F.broadcast(small), 'id')★
Broadcast hint — avoids shuffling the big side.
df.crossJoin(other)
Cartesian product — every row with every row.

15Union & Set Operationsshape & combine

df.unionByName(df2)★
Stack rows, matching columns by name — safer.
df.union(df2)
Stack rows, matching columns by position.
df.unionByName(df2, allowMissingColumns=True)
Union frames with differing columns.
df.subtract(df2) / df.exceptAll(df2)★
Rows in df not in df2 (distinct / keep dupes).
df.intersect(df2) / df.intersectAll(df2)
Rows common to both frames.

16GroupBy, Aggregation & Pivotaggregate / stats

df.groupBy('c').count()★
Row count per group.
df.groupBy('c').agg(F.mean('x'), F.sum('y'))★
Multiple aggregates per group.
F.countDistinct(c) F.approx_count_distinct(c)★
Exact / fast-approximate distinct counts.
F.collect_list(c) F.collect_set(c)
Gather group values into an array / set.
F.first(c) F.last(c) F.stddev(c)
Positional & spread aggregates.
df.groupBy('c').pivot('k').agg(F.sum('v'))★
Long → wide, one column per key value.
df.rollup('a','b') df.cube('a','b')
Subtotal / all-combination aggregations.

17Window Functionsaggregate / stats

from pyspark.sql import Window★
Import the window-spec builder.
w = Window.partitionBy('c').orderBy(F.desc('d'))★
Partitions + ordering for a window.
F.row_number().over(w)★
1,2,3... within each partition — no ties.
F.rank().over(w) / F.dense_rank().over(w)
Ranked position — with / without gaps.
F.lag(c,1).over(w) / F.lead(c,1).over(w)
Previous / next row's value.
w.rowsBetween(Window.unboundedPreceding, 0)
Frame bounds for running aggregates.

18Reshape, Sample & df.statshape · df.stat

df.orderBy(F.desc('c')) df.limit(n)★
Sort / take first n rows.
F.explode(...) + pivot
Unpivot pattern: stack then pivot (no native melt).
df.sample(0.1, seed=42) df.randomSplit([0.8,0.2])★
Random sample / train-test split.
df.stat.approxQuantile('c', [.25,.5,.75], 0.01)
Approximate percentiles.
df.stat.corr('a','b') df.stat.crosstab('a','b')
Correlation / contingency table.

19Running SQL Queriesselect & filter

df.createOrReplaceTempView('t')★
Register a DataFrame as a SQL-queryable view.
spark.sql('SELECT * FROM t WHERE age > 24')★
Query with SQL — returns a DataFrame.
spark.sql('... {df}', df=df)
Parameterized SQL with DataFrame refs (3.4+).
df.createGlobalTempView('t')
Cross-session view — prefix global_temp.

20Python UDFshandle with care

@F.udf(T.IntegerType()) def double(x): return x*2
Row-at-a-time UDF via decorator.
my_udf = F.udf(lambda x: x*2, 'int')
Inline form with a DDL return type.
df.withColumn('c', my_udf(df.c))
Apply like any column expression.
prefer built-in F.* functionsfaster
Plain UDFs run in Python row-by-row — 10×+ slower.

21Pandas UDFs — vectorizedArrow-accelerated

@F.pandas_udf('double') def z(s: pd.Series) -> pd.Series: ...★
Series→Series scalar UDF — one row out per row in.
@F.pandas_udf('long') # Iterator[Series]
Iterator variant — reuse expensive setup per batch.
@F.pandas_udf('double') # GROUPED_AGG
Series→scalar — use inside groupBy().agg().
needs pyarrow; type hints drive the kind
Vectorized via Arrow — far faster than plain UDFs.

22Pandas Function APIssplit-apply-combine

df.groupby('id').applyInPandas(fn, schema)★
Grouped map: fn(pdf)→pdf per group.
df.mapInPandas(fn, schema)★
Map: iterator of pdf→iterator of pdf, any row count.
df1.groupby('k').cogroup(df2.groupby('k')) .applyInPandas(fn, schema)
Cogrouped map — e.g. a pandas asof-join.
df.mapInArrow(fn, schema)
Arrow-batch variant — no pandas conversion.

23Streaming Sourcesstream

spark.readStream.format('kafka').option(...).load()★
Read an unbounded stream (Kafka, socket…).
spark.readStream.schema(s).json(path)★
File-source stream — schema is required.
spark.readStream.format('rate').load()
Synthetic rate source — great for testing.
df.isStreaming
True if this DataFrame is a streaming one.

24Streaming Transforms & Windowsstream

select / filter / withColumn / groupBy★
The same DataFrame ops work on streams.
F.window('ts', '10 minutes', '5 minutes')★
Tumbling / sliding event-time windows.
df.withWatermark('ts', '10 minutes')★
Bound state & handle late data.
no sort / limit / distinct on raw streamsunsupported
Some batch ops aren't allowed on streams.

25Streaming Sinks & Queriesstream

df.writeStream.outputMode('append')★
Mode: append / update / complete.
.format('console').start()★
Start the query; sinks: console/kafka/parquet.
.trigger(availableNow=True)★
Trigger: processingTime / availableNow / once.
.option('checkpointLocation', path)★
Required for fault-tolerant recovery.
.foreachBatch(fn).start()
Run arbitrary batch logic per micro-batch.
query.awaitTermination() spark.streams.active
Block until done / list running queries.

26Feature Engineeringml · pyspark.ml.feature

VectorAssembler(inputCols=[...], outputCol='features')★
Bundle columns into one features vector — required first step.
StringIndexer(inputCol='cat', outputCol='idx')★
Encode a categorical string as numeric indices.
OneHotEncoder(inputCol='idx', outputCol='ohe')
Expand indices into sparse one-hot vectors.
StandardScaler(inputCol='features', outputCol='scaled')★
Standardize features to zero mean / unit variance.
Tokenizer StopWordsRemover HashingTF IDF
Text-feature building blocks.

27Pipeline: fit & transformml · Estimator/Transformer

from pyspark.ml import Pipeline★
Chain stages into one reproducible flow.
pipe = Pipeline(stages=[idx, asm, lr])★
Stages run in order; last is usually the model.
model = pipe.fit(train_df)★
Estimator — .fit() learns and returns a model.
preds = model.transform(test_df)★
Transformer — .transform() adds a prediction column.
model.write().overwrite().save(path)
Persist / reload a fitted pipeline model.

28Algorithmsml · classification/regression/clustering

LogisticRegression(featuresCol='features', labelCol='y')★
Classification baseline.
RandomForestClassifier GBTClassifier
Tree ensembles for classification.
LinearRegression GBTRegressor★
Regression estimators.
KMeans(k=3, featuresCol='features')
Unsupervised clustering.
ALS(userCol, itemCol, ratingCol)
Collaborative-filtering recommender.

29Evaluation & Tuningml · tuning/evaluation

BinaryClassificationEvaluator(labelCol='y')★
AUC / PR for binary models.
MulticlassClassificationEvaluator(metricName='f1')
Accuracy / F1 for multiclass.
RegressionEvaluator(metricName='rmse')
RMSE / MAE / R² for regression.
ParamGridBuilder().addGrid(lr.regParam,[.01,.1]).build()★
Hyperparameter grid.
CrossValidator(estimator, grid, evaluator, numFolds=3)★
K-fold CV (or TrainValidationSplit).

30Pandas API on Sparkpyspark.pandas as ps

import pyspark.pandas as ps★
pandas syntax, distributed on Spark (ex-Koalas).
psdf = ps.read_parquet(path)
Familiar readers return a pandas-on-Spark frame.
psdf.groupby('c').agg('mean') psdf['x'].fillna(0)★
Write pandas code; Spark runs it at scale.
psdf.to_spark() / df.pandas_api()★
Convert to/from a native Spark DataFrame.
good migration bridge from pandas
Not 100% of pandas is covered — check the docs.

31RDD & Spark Corelow-level · sparkContext

sc = spark.sparkContext★
The lower-level Spark Core entry point.
rdd = sc.parallelize([1,2,3]) sc.textFile(path)
Create an RDD from a list / text files.
rdd.map(fn) rdd.flatMap(fn) rdd.filter(fn)
Lazy RDD transformations.
rdd.reduceByKey(add) rdd.groupByKey()
Key-value shuffles (prefer reduceByKey).
rdd.collect() rdd.take(n) rdd.count()
RDD actions that trigger execution.
prefer DataFrames
RDDs skip Catalyst — no automatic optimization.

32Broadcast & Accumulatorslow-level · shared state

bc = sc.broadcast(lookup_dict)★
Ship a read-only value to every executor once.
bc.value
Access the broadcast payload inside tasks.
acc = sc.accumulator(0)
Write-only counter aggregated on the driver.
F.broadcast(df) ≠ sc.broadcast(x)
One hints a join; the other ships a Python value.

33Repartition, Cache & Performancehandle with care

df.repartition(n, 'key')full shuffle
Shuffle to n partitions (up or down).
df.coalesce(n)★cheaper
Merge partitions, no full shuffle — decrease only.
df.cache() / df.persist(StorageLevel.MEMORY_AND_DISK)★
Reuse a DataFrame across actions.
df.unpersist()
Free cached data when done.
spark.conf.set('spark.sql.adaptive.enabled', True)
AQE: runtime shuffle & skew optimization.

34Collect Output & Deployinspect · CLI

df.collect()★memory
Pull all rows to the driver — use with care.
df.take(n) df.head(n) df.first()
Pull just a few rows to the driver.
df.toPandas()★memory
Convert to pandas — driver-memory bound.
df.foreach(fn) df.toLocalIterator()
Per-row side effects / stream rows to driver.
spark-submit --packages ... app.py
Submit a job; --packages pulls JARs (Kafka, Delta).

★Common Data Typespyspark.sql.types as T

T.StringType() T.IntegerType() T.LongType()★
Text and whole-number types.
T.DoubleType() T.FloatType() T.DecimalType()
Floating-point & exact-decimal numerics.
T.BooleanType()
True / False values.
T.DateType() T.TimestampType()★
Calendar date / date + time.
T.ArrayType(t) T.MapType(k,v)
Collection column types.
T.StructType / T.StructField
Nested, schema-defined record columns.

★Join how= Quick-Readdf.join(..., how=)

'inner'★
Only keys present in both DataFrames.
'left' / 'right' / 'outer'★
Keep all rows from left / right / both sides.
'left_semi'
Left rows that have a match — right cols dropped.
'left_anti'★
Left rows with no match — great for exclusions.
'cross'
Cartesian product, no key needed.

★Transformations vs Actionsthe lazy/eager split

select filter withColumn join groupBy orderBylazy
Transformations — build the plan, run nothing.
show collect count take write toPandas foreacheager
Actions — trigger the whole plan to execute.
narrow: map/filter · wide: join/groupBy/distinct
Wide ops shuffle across the network; narrow don't.

pyspark cheat sheet v2 · all modules

I · Core DataFrame & SQL pyspark.sql — the everyday API

II · The Functions Library pyspark.sql.functions as F — column expressions

III · Reshape, Combine & Aggregate joins · groupBy · window · reshape

IV · SQL, UDFs & pandas Function APIs custom logic — from slow to vectorized

V · Structured Streaming pyspark.sql.streaming — same DataFrame API, unbounded input

VI · MLlib — Machine Learning pyspark.ml — DataFrame-based, Pipeline-oriented

VII · pandas-on-Spark, RDD Core & Performance migration path · low-level · tuning

Custom-logic decision map & the ML pipeline

Which custom-function API? ★

The MLlib pipeline ★

Worth memorizing