Comprehensive Quick Reference · distributed data processing in Python

pyspark cheat sheet v2 · all modules

A single source of truth spanning the whole PySpark surface: the DataFrame & SQL core, the functions library, UDFs & pandas function APIs, Structured Streaming, MLlib, the pandas-on-Spark API, and the low-level RDD core. Everything still hinges on one split — lazy transformations build a plan, an action runs it across the cluster.

create / load inspect / config select & filter shape & combine clean & transform aggregate / stats MLlib streaming handle with care most common

Validated against the official Apache Spark 4.x PySpark API reference (spark.apache.org) — Spark SQL, Functions, UDF/UDTF, Structured Streaming, MLlib (DataFrame-based), Pandas-API-on-Spark & Spark Core — cross-checked with DataCamp, kevinschaich/pyspark-cheatsheet, Palantir Foundry docs, AlmaBetter & GeeksforGeeks. v2 gap-analysis edition.

The PySpark module map — what lives where
SparkSession the single entry point pyspark.sql — the DataFrame & SQL core DataFrame · Column · Row · GroupedData · Window functions (F.*) · types (T.*) · Catalog · UDF / UDTF pyspark.sql.streaming Structured Streaming readStream / writeStream pyspark.ml MLlib (DataFrame-based) Pipeline / Estimator / Transformer pyspark.pandas pandas API on Spark import pyspark.pandas as ps pyspark (core) SparkContext · RDD low-level, no Catalyst sparkContext

I · Core DataFrame & SQL pyspark.sql — the everyday API

01Setup & SparkSessioncreate
02Configuration & Cataloginspect / config
03Reading Datacreate
04Writing Datacreate
05Creating DataFrames & Schemascreate
06Inspect & Exploreinspect
07Selecting & Filteringselect & filter
08Column Operations & Conditionalsshape & combine
09Missing Values & Duplicatesclean · df.na

II · The Functions Library pyspark.sql.functions as F — column expressions

10String Functionsclean & transform
11Date & Timestamp Functionsclean & transform
12Numeric, Math & Hashingclean & transform
13Array, Map & Higher-Ordershape & combine

III · Reshape, Combine & Aggregate joins · groupBy · window · reshape

14Joinsshape & combine
15Union & Set Operationsshape & combine
16GroupBy, Aggregation & Pivotaggregate / stats
17Window Functionsaggregate / stats
18Reshape, Sample & df.statshape · df.stat

IV · SQL, UDFs & pandas Function APIs custom logic — from slow to vectorized

19Running SQL Queriesselect & filter
20Python UDFshandle with care
21Pandas UDFs — vectorizedArrow-accelerated
22Pandas Function APIssplit-apply-combine

V · Structured Streaming pyspark.sql.streaming — same DataFrame API, unbounded input

23Streaming Sourcesstream
24Streaming Transforms & Windowsstream
25Streaming Sinks & Queriesstream

VI · MLlib — Machine Learning pyspark.ml — DataFrame-based, Pipeline-oriented

26Feature Engineeringml · pyspark.ml.feature
27Pipeline: fit & transformml · Estimator/Transformer
28Algorithmsml · classification/regression/clustering
29Evaluation & Tuningml · tuning/evaluation

VII · pandas-on-Spark, RDD Core & Performance migration path · low-level · tuning

30Pandas API on Sparkpyspark.pandas as ps
31RDD & Spark Corelow-level · sparkContext
32Broadcast & Accumulatorslow-level · shared state
33Repartition, Cache & Performancehandle with care
34Collect Output & Deployinspect · CLI
Common Data Typespyspark.sql.types as T
Join how= Quick-Readdf.join(..., how=)
Transformations vs Actionsthe lazy/eager split

Custom-logic decision map & the ML pipeline

Two things trip up newcomers most: choosing the right way to run Python logic on Spark, and the estimator/transformer split in MLlib. Based on the official PySpark UDF & MLlib guides.

Which custom-function API? ★

Pick by input→output shape. Native F.* is always fastest; drop to pandas/Python only when you must.

need custom logic? can a built-in F.* do it? → use it 1 row → 1 rowpandas_udf(Series→Series) grouped inputapplyInPandas(per-group pdf) many rows outmapInPandas(iter pdf→iter pdf) last resort: plain Python @F.udfrow-at-a-time, ~10× slower

The MLlib pipeline ★

Estimators .fit() to become models; transformers .transform() to add columns. A Pipeline chains both.

train_df StringIndexer+ Assembler Classifier(estimator) Pipeline pipe.fit(train_df) PipelineModel model.transform(test_df) → predictions

Worth memorizing

transformations vs actionsnothing runs until an action — show/collect/count/write
wide vs narrowgroupBy/join/distinct shuffle; select/filter/withColumn don't
F.* > pandas_udf > udfalways prefer native functions; UDFs are the slow fallback
fit vs transformestimators learn (fit→model); transformers add columns (transform)
streaming needs a checkpointno checkpointLocation = no fault tolerance
watermark bounds statewithout it, streaming aggregations grow state forever
collect()/toPandas() on driverpulls the whole result into driver memory — can crash it
repartition vs coalescerepartition always shuffles; coalesce is cheaper but only shrinks
cache reused frameselse Spark recomputes the full lineage every action
AQE is your friendspark.sql.adaptive.enabled fixes skew & partition counts at runtime
pandas-on-Sparkimport pyspark.pandas as ps — pandas syntax, Spark scale
DataFrames > RDDsRDDs skip Catalyst; you lose automatic optimization