PySpark Cheat Sheet

01Setup & SparkSessioncreate

from pyspark.sql import SparkSession★
Entry point for the DataFrame & SQL API.
spark = SparkSession.builder.appName('app').getOrCreate()★
Create (or reuse) the active session.
pip install pyspark
Install (or brew install apache-spark for the shell).
spark.sparkContext
Underlying SparkContext, for low-level RDD work.
spark.version
Check the installed Spark version.
spark.stop()
Release cluster resources when done.

02Reading Datacreate

spark.read.csv(path, header=True, inferSchema=True)★
Read CSV; infer types or pass an explicit schema.
spark.read.json(path)★
Read line-delimited or multiline JSON.
spark.read.parquet(path)★
Read columnar Parquet — the Spark-native default.
spark.read.format('jdbc').options(**opts).load()
Read from a relational database.
spark.read.load(path, format='orc')
Generic loader for any registered format.
spark.read.text(path)
Read each line as a single-column row.

03Writing Datacreate

df.write.csv(path, header=True)★
Write to CSV (one file per partition).
df.write.parquet(path)★
Write columnar Parquet — fast, typed, compressed.
df.write.mode('overwrite').save(path)★
mode: overwrite, append, ignore, or error.
df.write.partitionBy('col').parquet(path)
Write as a Hive-style partitioned directory tree.
df.write.format('jdbc').options(**opts).save()
Write to a relational database.

04Creating DataFrames & Schemascreate

spark.createDataFrame(data, schema)★
From a list of Rows/tuples, plus a schema.
from pyspark.sql.types import StructType, StructField, StringType
Import schema-building blocks.
StructType([StructField('name', StringType(), True)])★
Explicit schema — column, type, nullable.
spark.createDataFrame(rdd.map(lambda p: Row(name=p[0])))
Build a DataFrame from an existing RDD.

05Inspect & Exploreinspect

df.show(n, truncate=False)★
Print the first n rows (default 20).
df.printSchema()★
Tree view of column names, types & nullability.
df.dtypes / df.schema
Column type list / full StructType schema object.
df.columns
List of column names.
df.count()★
Number of rows — an action, triggers a full pass.
df.describe().show()
Summary stats for numeric & string columns.
df.explain()
Print the logical & physical execution plan.

06Selecting & Filteringselect & filter

df.select('a', 'b')★
Project a subset of columns.
df.select(F.col('a'), F.col('b').alias('bb'))
Select with expressions & renaming.
df.filter(df.age > 24) / df.where(...)★
filter and where are exact aliases.
df.filter((df.a>1)&(df.b<9))
Combine conditions with & | ~, in parens.
df.filter(df.name.isin('Bob','Mike'))★
Keep rows matching any value in a list.
df.filter(df.name.like('Al%'))
SQL-style wildcard match.
df.filter(df.name.rlike('[A-Z]+ice$'))
Regex match.

07Column Operations & Conditionalsshape & combine

df.withColumn('new', expr)★
Add or replace a column (returns a new DataFrame).
df.withColumnRenamed('old', 'new')★
Rename a single column.
df.drop('c1', 'c2')★
Remove one or more columns.
df.withColumn('p', df.p.cast(T.DoubleType()))★
Cast a column to a new type.
F.when(cond, val).otherwise(default)★
Vectorized if/elif/else across rows.
F.coalesce(col1, col2, F.lit('N/A'))
First non-null value across columns.

08Missing Values & Duplicatesclean & transform

df.na.fill(value) / df.fillna({'age': 0})★
Replace nulls — scalar or per-column dict.
df.na.drop() / df.dropna()★
Drop rows containing any null.
df.filter(df.col.isNull()) / .isNotNull()★
Boolean null checks.
df.dropDuplicates() / df.distinct()★
Remove fully duplicate rows.
df.dropDuplicates(['a', 'b'])
De-duplicate on a subset of columns.
df.replace({'': None}, subset=['name'])
Swap specific values for null (or another value).

09String Functionsclean & transform

F.concat_ws('-', 'fname', 'lname')★
Join columns with a separator.
F.upper(col) / F.lower(col)★
Case conversion.
F.trim(col) F.ltrim(col) F.rtrim(col)
Strip whitespace.
F.substring(col, pos, len)
Extract a fixed-position substring.
F.regexp_replace(col, pattern, repl)★
Regex find-and-replace.
F.regexp_extract(col, pattern, idx)
Pull out a regex capture group.
F.split(col, pattern)
Split into an array column.

10Date & Timestamp Functionsclean & transform

F.current_date() / F.current_timestamp()
Today's date / the current instant.
F.to_date(col, 'yyyy-MM-dd')★
Parse a string into a date column.
F.to_timestamp(col, fmt)★
Parse a string into a timestamp column.
F.year(col) F.month(col) F.dayofmonth(col)★
Extract date parts.
F.date_add(col, n) / F.date_sub(col, n)
Shift a date forward / backward n days.
F.datediff(end, start)
Whole days between two dates.

11Numeric Functionsclean & transform

F.round(col, scale)
Round to a given number of decimal places.
F.floor(col) / F.ceil(col)
Round down / up to the nearest integer.
F.abs(col)
Absolute value.
F.least(*cols) / F.greatest(*cols)
Smallest / largest value across columns.
F.pow(x, y)
x raised to the power y.

12Array & Struct Operationsshape & combine

F.explode(col)★
One output row per array/map element.
F.array(*cols) / F.size(col)
Build an array column / get its length.
col.getItem(0)
Element at index 0 of an array column.
F.array_distinct(col)
Unique elements within each array.
F.struct(*cols)
Bundle columns into a single struct column.
col.getField('x')
Pull one field out of a struct column.

13Joinsshape & combine

df.join(other, 'id', 'inner')★
Join on a shared column name.
df.join(other, df.id==other.pid, 'left')★
Join on an explicit condition.
df.join(other, ['first','last'], 'left')
Join on multiple key columns.
df.join(F.broadcast(small_df), 'id')★
Hint: replicate a small table instead of shuffling.
df.crossJoin(other)
Cartesian product — every row with every row.

14Combining: Unionshape & combine

df.union(df2)★
Stack rows; columns matched by position, not name.
df.unionByName(df2)★
Stack rows; columns matched by name — safer.
df.union(df2).distinct()
Union, then drop resulting duplicate rows.

15GroupBy, Aggregation & Pivotaggregate / stats

df.groupBy('col').count()★
Row count per group.
df.groupBy('c').agg(F.mean('x'), F.max('y'))★
Multiple aggregates per group.
df.groupBy('c').agg(F.collect_list('x'))
Gather each group's values into an array.
df.groupBy('c').pivot('k').agg(F.sum('v'))★
Long → wide, one column per key value.
df.agg(F.sum('x'), F.avg('x'))
Aggregate the whole DataFrame, no grouping.

16Window Functionsaggregate / stats

from pyspark.sql import Window★
Import the window-spec builder.
w = Window.partitionBy('c').orderBy(F.desc('date'))★
Define partitions + ordering for a window.
F.row_number().over(w)★
1, 2, 3... within each partition — no ties.
F.rank().over(w) / F.dense_rank().over(w)
Ranked position — with / without gaps on ties.
F.lag(col, 1).over(w) / F.lead(col, 1).over(w)
Previous / next row's value.
F.sum(col).over(w)
Running total within each partition.

17Sorting & Top-Nselect & filter

df.orderBy('col')★
Ascending sort by one or more columns.
df.orderBy(F.desc('col'))★
Descending sort.
df.orderBy(['a','b'], ascending=[0,1])
Mixed ascending/descending, per column.
df.limit(n)★
Keep only the first n rows.

18Running SQL Queriesselect & filter

df.createOrReplaceTempView('t')★
Register a DataFrame as a SQL-queryable view.
spark.sql("SELECT * FROM t WHERE age > 24")★
Query with plain SQL — returns a DataFrame.
df.createGlobalTempView('t')
View visible across sessions — prefix global_temp.

19UDFs — User Defined Functionshandle with care

F.udf(lambda x: x * 2, T.IntegerType())
Wrap a Python function as a Spark column function.
@F.udf(T.StringType()) def f(x): ...
Decorator form of the same thing.
df.withColumn('c', my_udf(df.c))
Apply a UDF like any other column expression.
prefer built-in F.* functionsfaster
UDFs run row-by-row in Python — often 10×+ slower.

20Converting & Collecting Outputinspect

df.collect()★memory
Pull all rows to the driver as a list — use with care.
df.take(n) / df.head(n)
Pull just the first n rows to the driver.
df.toPandas()★memory
Convert to a pandas DataFrame — driver-memory bound.
df.rdd
Drop down to the underlying RDD.
df.toJSON()
Convert each row to a JSON string.

21Repartitioning, Caching & Performancehandle with care

df.repartition(n)full shuffle
Shuffle to exactly n partitions — can increase or decrease.
df.coalesce(n)★cheaper
Merge partitions without a full shuffle — decrease only.
df.cache() / df.persist()★
Keep a reused DataFrame in memory across actions.
df.unpersist()
Free cached data once you're done with it.
df.rdd.getNumPartitions()
Check how many partitions a DataFrame has.

★Common Data Typespyspark.sql.types

StringType() IntegerType() LongType()★
Text and whole-number types.
DoubleType() / FloatType()
Floating-point numeric types.
BooleanType()
True / False values.
DateType() / TimestampType()★
Calendar date / date + time.
ArrayType(elementType)
A column whose values are lists.
StructType / StructField
Nested, schema-defined record columns.

★Join how= Quick-Readdf.join(..., how=)

'inner'★
Only keys present in both DataFrames.
'left' / 'left_outer'★
All rows from the left frame, matched where possible.
'right' / 'right_outer'
All rows from the right frame, matched where possible.
'outer' / 'full'
Every row from both sides; null where unmatched.
'left_semi'
Left rows that have a match — right columns dropped.
'left_anti'★
Left rows with no match in the right — great for exclusions.

pyspark cheat sheet

Shuffles, broadcasts & partitions, visually

Narrow transformation ★

Wide transformation

Broadcast join ★

coalesce() vs repartition()

Worth memorizing