from pyspark.sql import SparkSession★Entry point for the DataFrame & SQL API.spark = SparkSession.builder.appName('app').getOrCreate()★Create (or reuse) the active session.pip install pysparkInstall (orbrew install apache-sparkfor the shell).spark.sparkContextUnderlying SparkContext, for low-level RDD work.spark.versionCheck the installed Spark version.spark.stop()Release cluster resources when done.
spark.read.csv(path, header=True, inferSchema=True)★Read CSV; infer types or pass an explicit schema.spark.read.json(path)★Read line-delimited or multiline JSON.spark.read.parquet(path)★Read columnar Parquet — the Spark-native default.spark.read.format('jdbc').options(**opts).load()Read from a relational database.spark.read.load(path, format='orc')Generic loader for any registered format.spark.read.text(path)Read each line as a single-column row.
df.write.csv(path, header=True)★Write to CSV (one file per partition).df.write.parquet(path)★Write columnar Parquet — fast, typed, compressed.df.write.mode('overwrite').save(path)★mode: overwrite, append, ignore, or error.df.write.partitionBy('col').parquet(path)Write as a Hive-style partitioned directory tree.df.write.format('jdbc').options(**opts).save()Write to a relational database.
spark.createDataFrame(data, schema)★From a list of Rows/tuples, plus a schema.from pyspark.sql.types import StructType, StructField, StringTypeImport schema-building blocks.StructType([StructField('name', StringType(), True)])★Explicit schema — column, type, nullable.spark.createDataFrame(rdd.map(lambda p: Row(name=p[0])))Build a DataFrame from an existing RDD.
df.show(n, truncate=False)★Print the first n rows (default 20).df.printSchema()★Tree view of column names, types & nullability.df.dtypes / df.schemaColumn type list / full StructType schema object.df.columnsList of column names.df.count()★Number of rows — an action, triggers a full pass.df.describe().show()Summary stats for numeric & string columns.df.explain()Print the logical & physical execution plan.
df.select('a', 'b')★Project a subset of columns.df.select(F.col('a'), F.col('b').alias('bb'))Select with expressions & renaming.df.filter(df.age > 24) / df.where(...)★filterandwhereare exact aliases.df.filter((df.a>1)&(df.b<9))Combine conditions with& | ~, in parens.df.filter(df.name.isin('Bob','Mike'))★Keep rows matching any value in a list.df.filter(df.name.like('Al%'))SQL-style wildcard match.df.filter(df.name.rlike('[A-Z]+ice$'))Regex match.
df.withColumn('new', expr)★Add or replace a column (returns a new DataFrame).df.withColumnRenamed('old', 'new')★Rename a single column.df.drop('c1', 'c2')★Remove one or more columns.df.withColumn('p', df.p.cast(T.DoubleType()))★Cast a column to a new type.F.when(cond, val).otherwise(default)★Vectorized if/elif/else across rows.F.coalesce(col1, col2, F.lit('N/A'))First non-null value across columns.
df.na.fill(value) / df.fillna({'age': 0})★Replace nulls — scalar or per-column dict.df.na.drop() / df.dropna()★Drop rows containing any null.df.filter(df.col.isNull()) / .isNotNull()★Boolean null checks.df.dropDuplicates() / df.distinct()★Remove fully duplicate rows.df.dropDuplicates(['a', 'b'])De-duplicate on a subset of columns.df.replace({'': None}, subset=['name'])Swap specific values for null (or another value).
F.concat_ws('-', 'fname', 'lname')★Join columns with a separator.F.upper(col) / F.lower(col)★Case conversion.F.trim(col) F.ltrim(col) F.rtrim(col)Strip whitespace.F.substring(col, pos, len)Extract a fixed-position substring.F.regexp_replace(col, pattern, repl)★Regex find-and-replace.F.regexp_extract(col, pattern, idx)Pull out a regex capture group.F.split(col, pattern)Split into an array column.
F.current_date() / F.current_timestamp()Today's date / the current instant.F.to_date(col, 'yyyy-MM-dd')★Parse a string into a date column.F.to_timestamp(col, fmt)★Parse a string into a timestamp column.F.year(col) F.month(col) F.dayofmonth(col)★Extract date parts.F.date_add(col, n) / F.date_sub(col, n)Shift a date forward / backward n days.F.datediff(end, start)Whole days between two dates.
F.round(col, scale)Round to a given number of decimal places.F.floor(col) / F.ceil(col)Round down / up to the nearest integer.F.abs(col)Absolute value.F.least(*cols) / F.greatest(*cols)Smallest / largest value across columns.F.pow(x, y)x raised to the power y.
F.explode(col)★One output row per array/map element.F.array(*cols) / F.size(col)Build an array column / get its length.col.getItem(0)Element at index 0 of an array column.F.array_distinct(col)Unique elements within each array.F.struct(*cols)Bundle columns into a single struct column.col.getField('x')Pull one field out of a struct column.
df.join(other, 'id', 'inner')★Join on a shared column name.df.join(other, df.id==other.pid, 'left')★Join on an explicit condition.df.join(other, ['first','last'], 'left')Join on multiple key columns.df.join(F.broadcast(small_df), 'id')★Hint: replicate a small table instead of shuffling.df.crossJoin(other)Cartesian product — every row with every row.
df.union(df2)★Stack rows; columns matched by position, not name.df.unionByName(df2)★Stack rows; columns matched by name — safer.df.union(df2).distinct()Union, then drop resulting duplicate rows.
df.groupBy('col').count()★Row count per group.df.groupBy('c').agg(F.mean('x'), F.max('y'))★Multiple aggregates per group.df.groupBy('c').agg(F.collect_list('x'))Gather each group's values into an array.df.groupBy('c').pivot('k').agg(F.sum('v'))★Long → wide, one column per key value.df.agg(F.sum('x'), F.avg('x'))Aggregate the whole DataFrame, no grouping.
from pyspark.sql import Window★Import the window-spec builder.w = Window.partitionBy('c').orderBy(F.desc('date'))★Define partitions + ordering for a window.F.row_number().over(w)★1, 2, 3... within each partition — no ties.F.rank().over(w) / F.dense_rank().over(w)Ranked position — with / without gaps on ties.F.lag(col, 1).over(w) / F.lead(col, 1).over(w)Previous / next row's value.F.sum(col).over(w)Running total within each partition.
df.orderBy('col')★Ascending sort by one or more columns.df.orderBy(F.desc('col'))★Descending sort.df.orderBy(['a','b'], ascending=[0,1])Mixed ascending/descending, per column.df.limit(n)★Keep only the first n rows.
df.createOrReplaceTempView('t')★Register a DataFrame as a SQL-queryable view.spark.sql("SELECT * FROM t WHERE age > 24")★Query with plain SQL — returns a DataFrame.df.createGlobalTempView('t')View visible across sessions — prefixglobal_temp.
F.udf(lambda x: x * 2, T.IntegerType())Wrap a Python function as a Spark column function.@F.udf(T.StringType())
def f(x): ...Decorator form of the same thing.df.withColumn('c', my_udf(df.c))Apply a UDF like any other column expression.prefer built-in F.* functionsfasterUDFs run row-by-row in Python — often 10×+ slower.
df.collect()★memoryPull all rows to the driver as a list — use with care.df.take(n) / df.head(n)Pull just the first n rows to the driver.df.toPandas()★memoryConvert to a pandas DataFrame — driver-memory bound.df.rddDrop down to the underlying RDD.df.toJSON()Convert each row to a JSON string.
df.repartition(n)full shuffleShuffle to exactly n partitions — can increase or decrease.df.coalesce(n)★cheaperMerge partitions without a full shuffle — decrease only.df.cache() / df.persist()★Keep a reused DataFrame in memory across actions.df.unpersist()Free cached data once you're done with it.df.rdd.getNumPartitions()Check how many partitions a DataFrame has.
StringType() IntegerType() LongType()★Text and whole-number types.DoubleType() / FloatType()Floating-point numeric types.BooleanType()True / False values.DateType() / TimestampType()★Calendar date / date + time.ArrayType(elementType)A column whose values are lists.StructType / StructFieldNested, schema-defined record columns.
'inner'★Only keys present in both DataFrames.'left' / 'left_outer'★All rows from the left frame, matched where possible.'right' / 'right_outer'All rows from the right frame, matched where possible.'outer' / 'full'Every row from both sides; null where unmatched.'left_semi'Left rows that have a match — right columns dropped.'left_anti'★Left rows with no match in the right — great for exclusions.