PySpark Interview Questions: Fundamentals, Performance, and Coding

PySpark interviews test three things: whether you understand how Apache Spark actually works, whether you can tune jobs that fall over in production, and whether you can write clean, idiomatic PySpark code. This guide covers the questions that show up most in data engineering interviews — grouped by level so you can focus on where you need the most work. If you're also preparing for cloud or infrastructure rounds, the AWS interview questions guide and the Kafka interview questions guide cover adjacent topics you'll likely face in the same loop.

Here's what you'll find:

Fundamentals — RDDs vs DataFrames, lazy evaluation, transformations vs actions, SparkSession
Performance — partitioning, shuffles, broadcast joins, caching, data skew
Coding — groupBy/agg, window functions, null handling, UDFs vs built-ins

Quick Reference: Topics, Questions, and What They Test

Topic	Sample Question	What the Interviewer Tests
RDD vs DataFrame	"When would you use an RDD over a DataFrame?"	API tradeoffs, Catalyst Optimizer awareness
Lazy evaluation	"Why does Spark use lazy evaluation?"	DAG understanding, optimization intuition
Transformations vs actions	"Name three wide transformations and explain the shuffle cost."	Execution model knowledge
Partitioning	"What's the difference between `repartition` and `coalesce`?"	Data movement costs, when each is appropriate
Broadcast join	"How does a broadcast join reduce shuffle?"	Join strategy selection, memory awareness
Data skew	"A single executor is running 10x longer than the rest — what do you check?"	Real-world debugging experience
Window functions	"Write a query that ranks employees by salary within each department."	SQL-style PySpark, `Window` spec fluency
UDFs	"Why avoid Python UDFs when a built-in exists?"	Performance cost of serialization overhead
Caching	"When would you call `cache()` vs `persist(DISK_ONLY)`?"	Storage level tradeoffs
Null handling	"How does PySpark treat nulls in aggregations?"	Edge-case awareness

Fundamentals

RDD vs DataFrame vs Dataset

RDD (Resilient Distributed Dataset) is Spark's lowest-level abstraction — an immutable, partitioned collection of objects processed in parallel. Spark tracks lineage so it can recompute lost partitions. You get full control over data structure and transformations, but no automatic optimization.

DataFrame is a distributed table with a named, typed schema. The Spark SQL Catalyst Optimizer rewrites your query into an efficient physical plan. Tungsten handles low-level memory and CPU optimization. For most work, DataFrames are faster and easier to maintain.

Dataset is the typed, JVM-native counterpart to DataFrame — relevant in Scala/Java, but in Python every DataFrame is already a Dataset[Row], so the distinction rarely comes up in PySpark interviews.

When to reach for an RDD: custom partitioning logic, low-level control over how data is serialized, or operations that have no DataFrame equivalent.

Lazy Evaluation

PySpark doesn't execute transformations immediately. Instead, it builds a directed acyclic graph (DAG) of operations. Nothing runs until you call an action (like .count(), .show(), .write()).

Why this matters in interviews: lazy evaluation lets Spark optimize the entire pipeline before running it — pushing filters down, eliminating unnecessary shuffles, combining adjacent maps. A question like "why did my filter not reduce the data before the join?" usually has the answer: an action triggered execution before the filter could propagate.

# Nothing runs here — Spark builds a DAG
df_filtered = df.filter(df["status"] == "active").select("user_id", "amount")

# This triggers execution
df_filtered.count()

Transformations vs Actions

Type	Examples	Behavior
Narrow transformation	`filter`, `map`, `select`, `withColumn`	Each input partition maps to one output partition — no shuffle
Wide transformation	`groupBy`, `join`, `distinct`, `repartition`	Data moves across the network — triggers a shuffle
Action	`count`, `show`, `collect`, `write`	Triggers the DAG to execute

Wide transformations are expensive because of the shuffle. A common interview follow-up: "What's the cost of calling groupBy on a 1 TB dataset?" The answer involves partitions, network I/O, and how many downstream stages are created.

SparkSession

SparkSession is the single entry point to all Spark functionality since Spark 2.0 — it replaces SparkContext, SQLContext, and HiveContext.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyJob") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

An interviewer asking about SparkSession wants to know you understand the execution environment — driver vs executors, application vs session lifecycle.

Performance Questions

Partitioning: repartition vs coalesce

`repartition(n)` reshuffles all data to create exactly n balanced partitions. Use it before a join or aggregation on an unbalanced dataset, or when you need to increase the partition count.

`coalesce(n)` reduces partitions without a full shuffle by combining adjacent ones. Use it when writing output — fewer files, less overhead.

# Increase partitions before a join
df_large = df_large.repartition(400, "customer_id")

# Reduce partitions before writing
df_result.coalesce(10).write.parquet("s3://bucket/output/")

The Apache Spark RDD Programming Guide recommends 2–4 partitions per CPU core as a starting point for spark.default.parallelism.

Shuffles and How to Reduce Them

A shuffle writes data to disk, sends it across the network, and reads it back. They're necessary for wide operations, but you can reduce their cost:

Pre-partition on the join key before repeated joins on the same column
Use `broadcast` for small tables — Spark ships the whole table to every executor, eliminating the shuffle entirely
Tune `spark.sql.shuffle.partitions` — the default is 200, which is too high for small data and too low for large clusters

Broadcast Joins

When one side of a join is small (typically under ~10 MB by default, configurable via spark.sql.autoBroadcastJoinThreshold), Spark can broadcast it to all executors:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), on="product_id", how="inner")

The interviewer wants to know: you understand the threshold, you know when to force it manually, and you know the risk (OOM if the "small" table is actually large).

Caching and Persist

.cache() stores the DataFrame in memory (default: MEMORY_AND_DISK). .persist(storageLevel) lets you choose:

Storage Level	When to Use
`MEMORY_ONLY`	Fast recompute; fits easily in RAM
`MEMORY_AND_DISK`	Default; spills to disk if needed
`DISK_ONLY`	Very large DataFrames; reuse isn't frequent
`OFF_HEAP`	Reduces GC pressure in tuned environments

Cache a DataFrame when you use it more than once in the same job. Don't cache everything — it evicts other data and wastes memory.

# Cache a filtered subset used in multiple subsequent joins
active_users = df.filter(df["status"] == "active").cache()

Data Skew

Skew is when one partition holds far more data than the others — one executor runs for 10 minutes while the rest finish in 30 seconds. Signs: the Spark UI shows a single slow task while all others are done.

Common fixes:

Salting: add a random prefix to the skewed key, join on the salted key, then aggregate again
Broadcast the small side: if one side of a join is small, broadcasting eliminates the skew problem entirely
Adaptive Query Execution (AQE): Spark 3.x can split skewed partitions automatically — check spark.sql.adaptive.enabled

import pyspark.sql.functions as F

# Salt a skewed join key
df_large = df_large.withColumn("salted_key",
    F.concat(df_large["customer_id"], F.lit("_"), (F.rand() * 10).cast("int")))

Coding Questions

groupBy and Aggregation

from pyspark.sql import functions as F

# Total revenue and average order value per region
result = df.groupBy("region") \
    .agg(
        F.sum("revenue").alias("total_revenue"),
        F.avg("order_value").alias("avg_order_value"),
        F.count("order_id").alias("order_count")
    )

A follow-up interviewers like: "What's the difference between groupBy + sum and reduceByKey on an RDD?" The DataFrame version benefits from Catalyst and partial aggregation (less data shuffled); reduceByKey does the same on RDDs but requires manual optimization.

Window Functions

Window functions apply calculations across a set of rows related to the current row — without collapsing rows like a groupBy does.

from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Rank employees by salary within each department
w = Window.partitionBy("department").orderBy(F.desc("salary"))

df_ranked = df.withColumn("rank", F.rank().over(w)) \
              .withColumn("dense_rank", F.dense_rank().over(w)) \
              .withColumn("running_total", F.sum("salary").over(
                  w.rowsBetween(Window.unboundedPreceding, Window.currentRow)
              ))

Know the difference between rank (gaps for ties), dense_rank (no gaps), and row_number (unique, arbitrary for ties). The PySpark DataFrame API reference covers the full Window spec.

Handling Nulls

PySpark propagates nulls through most operations. In aggregations, sum, avg, and count(col) ignore nulls; count(*) counts all rows including nulls.

# Fill nulls before aggregation
df_clean = df.fillna({"revenue": 0, "region": "unknown"})

# Drop rows where key columns are null
df_clean = df.dropna(subset=["user_id", "event_type"])

# Filter explicitly
df_clean = df.filter(df["user_id"].isNotNull())

A common interview trap: "What does df.filter(df['x'] != 5) return when x is null?" Answer: rows where x is null are dropped — null comparisons return null, not false. Use isNull() / isNotNull() explicitly.

UDFs vs Built-in Functions

Avoid Python UDFs when a built-in exists. Python UDFs serialize each row between the JVM and Python interpreter — they're 10–100x slower than built-in Spark functions, which run inside the JVM with Catalyst optimization.

# Bad: Python UDF for something built-in handles fine
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def upper_udf(s):
    return s.upper() if s else None

# Good: use the built-in
from pyspark.sql.functions import upper
df = df.withColumn("name_upper", upper(df["name"]))

When you genuinely need a UDF — custom business logic, third-party library calls — prefer Pandas UDFs (also called vectorized UDFs). They operate on Arrow-backed batches, avoiding the row-by-row serialization cost. Databricks' documentation on Delta Lake best practices also covers how to structure pipelines to minimize UDF usage.

For complex transformations involving Python-native libraries, check the Python UDF documentation for the current scalar and table function options.

Prepping for the Specific Interviewer

Clearing the technical bar gets you to the next round. What gets you the offer is showing up to that next conversation prepared — knowing what the interviewer actually cares about, what projects they've shipped, and what problems their team is solving.

If you're targeting data engineering roles at specific companies, Articuler can help you find the exact hiring manager or engineering lead behind the role, build a Playbook on that person before the interview, and write a personalized intro that gets a reply — rather than disappearing into a black-hole ATS. Semantic matching across 980M+ profiles, AI-drafted outreach with reply rates of 40–60%, and interview prep on the specific person you're meeting — not generic Glassdoor advice. For more interview prep on the technical side, the data engineer interview questions guide and the technical interview questions overview are good starting points.

FAQ

What is the difference between a transformation and an action in PySpark?

A transformation (like filter, select, groupBy) returns a new DataFrame and is lazy — it builds the DAG but doesn't execute. An action (like count, show, collect, write) triggers execution of the entire DAG up to that point. This distinction matters for debugging: if you're not seeing results, check whether you've called an action.

Why is shuffle expensive in Spark?

A shuffle involves writing intermediate data to disk on each executor, transferring it over the network to the correct target partitions, then reading it back. This disk I/O plus network transfer is the primary bottleneck in wide operations like groupBy and join. Reducing shuffle volume — via broadcast joins, pre-partitioning, or filtering before aggregating — is the single highest-leverage performance optimization in most PySpark jobs.

When should I use cache() in PySpark?

Cache a DataFrame when you reference it more than once in the same job — for example, a filtered base dataset used in two different joins. Don't cache speculatively. Each cached DataFrame occupies executor memory, which evicts other cached data and can cause spills. Always call df.unpersist() when you're done with a cached DataFrame to free the memory explicitly.

What is adaptive query execution (AQE) in Spark 3?

AQE (enabled by default in Spark 3.2+) allows Spark to re-optimize the physical plan at runtime using statistics collected during execution. It can automatically coalesce small shuffle partitions, switch join strategies based on actual data sizes, and split skewed partitions. For most teams, enabling AQE reduces the need to manually tune spark.sql.shuffle.partitions and handle skew with salt keys.

https://www.articuler.ai/resources/guides/data-engineer-interview-questions/
https://www.articuler.ai/resources/guides/technical-interview-questions/
https://www.articuler.ai/resources/guides/kafka-interview-questions/
https://www.articuler.ai/resources/guides/aws-interview-questions/
https://www.articuler.ai/product/ai-meeting-prep/