Quick Reference · distributed data processing in Python

pyspark cheat sheet

PySpark's DataFrame API looks like pandas, but every call is either a lazy transformation (builds a plan, runs nothing) or an action (triggers execution across a cluster). Understand that split — plus partitions and shuffles — and the rest of the API is just vocabulary.

create / load inspect select & filter shape & combine clean & transform aggregate / stats handle with care most common

Distilled & cross-checked across: spark.apache.org/docs (PySpark 4.x) · DataCamp (RDD & DataFrame sheets) · kevinschaich/pyspark-cheatsheet · Palantir Foundry docs · AlmaBetter · GeeksforGeeks

Transformations are lazy — an action triggers a distributed job
TRANSFORMATIONS — build a plan, nothing runs yet spark.read.csv() .filter(...) .select(...) .groupBy().agg() .show() / .collect() / .write() ACTION — triggers execution Driver builds DAG, schedules tasks Executor 1 P1 P2 P3 Executor 2 P4 P5 P6 Executor 3 P7 P8 P9 data is split into partitions (P1…P9) — tasks run on them in parallel
01Setup & SparkSessioncreate
02Reading Datacreate
03Writing Datacreate
04Creating DataFrames & Schemascreate
05Inspect & Exploreinspect
06Selecting & Filteringselect & filter
07Column Operations & Conditionalsshape & combine
08Missing Values & Duplicatesclean & transform
09String Functionsclean & transform
10Date & Timestamp Functionsclean & transform
11Numeric Functionsclean & transform
12Array & Struct Operationsshape & combine
13Joinsshape & combine
14Combining: Unionshape & combine
15GroupBy, Aggregation & Pivotaggregate / stats
16Window Functionsaggregate / stats
17Sorting & Top-Nselect & filter
18Running SQL Queriesselect & filter
19UDFs — User Defined Functionshandle with care
20Converting & Collecting Outputinspect
21Repartitioning, Caching & Performancehandle with care
Common Data Typespyspark.sql.types
Join how= Quick-Readdf.join(..., how=)

Shuffles, broadcasts & partitions, visually

The difference between a fast Spark job and a slow one usually comes down to whether data has to move across the network. Based on the official Spark execution-model diagrams.

Narrow transformation ★

filter() / select() / withColumn() — each output partition depends on exactly one input partition. No shuffle.

P1 P2 P3 P1' P2' P3'

Wide transformation

groupBy() / join() / distinct() — rows with matching keys may live on any partition, so data must move across the network.

P1 P2 P3 P1' P2' P3' shuffle across the network

Broadcast join ★

F.broadcast(small_df) copies the small table to every executor, so the big table never has to shuffle.

small_df Executor 1big_df P1 Executor 2big_df P2 Executor 3big_df P3

coalesce() vs repartition()

coalesce(2) merges neighboring partitions locally — cheap. repartition() always does a full shuffle, even to grow partition count.

P1 P2 P3 P4 P1+P2 P3+P4 coalesce(2) — local merge, no shuffle repartition() = full shuffle

Worth memorizing

transformations vs actionsnothing runs until an action — show/collect/count/write
filter() == where()pure aliases, use whichever reads better
wide vs narrowgroupBy/join/distinct shuffle over the network; filter/select don't
collect() on the driverpulls the entire result into driver memory — can crash it
cache() reused frameswithout it, Spark recomputes the whole lineage every action
repartition vs coalescerepartition always shuffles; coalesce is cheaper but only shrinks
UDFs are a last resortbuilt-in pyspark.sql.functions beat Python UDFs by 10×+
small file problemtoo many tiny output files hurt downstream reads — coalesce first