Apache Spark Interview Questions and Answers for Data Engineers

Put this into action

Turn this guide into better conversations with Articuler

Use this guide as the research layer, then turn the next step into a live networking workflow: search by intent, prep for the conversation, and send outreach that is built for replies.

Try the Articuler workflow

Spark interviews for data engineering roles rarely test whether you can recite definitions. They test whether you understand *why* a job is slow, *what* happens when a stage shuffles 200 GB, and *how* you'd fix it. The questions below are the ones that actually come up, grouped so you can study by topic instead of scrolling a random list.

Here's what most candidates get wrong: they memorize "RDD is low-level, DataFrame is high-level" and stop there. The interviewer's follow-up is always *why does that matter for performance* — and that's where the answer falls apart. Each question here includes the short correct answer plus the follow-up reasoning that separates a pass from a hire.

This guide covers:

Basics — what Spark is, lazy evaluation, transformations vs. actions
RDD vs. DataFrame vs. Dataset — the trade-offs and when each wins
Architecture — driver, executors, cluster manager, jobs and stages
Performance tuning — shuffles, partitioning, caching, skew, the Catalyst optimizer
Structured streaming — the unbounded-table model and exactly-once guarantees

Spark basics

What is Apache Spark, and how is it different from Hadoop MapReduce?

Spark is a distributed processing engine for large-scale data. It was developed at UC Berkeley's AMPLab in 2009 and donated to the Apache Software Foundation in 2013.

The key difference from MapReduce: Spark keeps intermediate data in memory between steps instead of writing it to disk after every stage. MapReduce reads from disk, processes, writes to disk, then repeats. For iterative workloads like machine learning, that disk I/O dominates runtime — which is why Spark can be many times faster on the same job.

Say "in-memory computation" in the interview, but be ready for the follow-up: *what happens when the data doesn't fit in memory?* Answer: Spark spills to disk, and you tune spark.memory.fraction and partition sizes to control how often that happens.

What is lazy evaluation in Spark?

Spark doesn't run your transformations when you write them. It builds a plan — a directed acyclic graph (DAG) of operations — and only executes when an action forces a result.

According to the official RDD programming guide, "all transformations in Spark are lazy... The transformations are only computed when an action requires a result to be returned to the driver program."

Why it matters: lazy evaluation lets Spark see the whole pipeline before running it, so it can collapse steps, push filters down, and avoid materializing data nobody asked for. A filter followed by a map becomes a single pass, not two.

What's the difference between a transformation and an action?

Aspect	Transformation	Action
Returns	A new RDD/DataFrame	A value to the driver or writes to storage
Execution	Lazy — builds the plan	Triggers the actual computation
Examples	`map`, `filter`, `groupBy`, `join`	`count`, `collect`, `take`, `write`
Network cost	Some cause shuffles (wide)	Pulls results back

A common trap: collect() is an action that pulls the entire dataset into the driver's memory. On a large result it crashes the driver. Use take(n) or write to storage instead.

What is a narrow vs. a wide transformation?

A narrow transformation (map, filter) needs data from only one input partition to compute one output partition — no data moves across the network. A wide transformation (groupByKey, join, repartition) needs data from multiple partitions, which triggers a shuffle.

Shuffles are the single biggest performance concern in Spark, so interviewers love this distinction. Wide transformations are where stage boundaries form and where most jobs slow down.

RDD vs. DataFrame vs. Dataset

What is an RDD?

A Resilient Distributed Dataset is Spark's original core abstraction — an immutable, fault-tolerant collection of objects partitioned across the cluster. Wikipedia describes it as "a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way."

The "resilient" part comes from lineage: Spark records how each RDD was built, so if a partition is lost to a node failure, it recomputes just that partition from the original source rather than restarting the job.

How do RDDs, DataFrames, and Datasets differ?

Feature	RDD	DataFrame	Dataset
Abstraction level	Low (objects)	High (named columns)	High (typed objects)
Schema	None	Yes	Yes
Type safety	Compile-time	Runtime only	Compile-time
Catalyst optimizer	No	Yes	Yes
Languages	All	All	Scala/Java only

The Spark SQL guide defines a DataFrame as "a Dataset organized into named columns" — in Scala and Java, a DataFrame is just a Dataset[Row].

When would you use an RDD over a DataFrame?

Rarely, and that's the honest answer interviewers want. Reach for RDDs when you need fine-grained control over physical data placement, you're working with unstructured data that has no schema, or you need a transformation the DataFrame API doesn't expose.

The cost is real: RDDs don't go through the Catalyst optimizer, so Spark can't rewrite your query plan. For 95% of data engineering work, DataFrames are the right default — they're faster *and* less code.

Why is the DataFrame API usually faster than raw RDDs?

Two engines do the heavy lifting. Catalyst optimizes the logical query plan (pushing filters down, pruning unused columns, reordering joins). Tungsten handles physical execution with off-heap memory management and code generation that produces tight JVM bytecode.

RDDs are opaque to both — Spark sees lambda functions it can't inspect. A DataFrame's structure lets Spark reason about the query and rewrite it for you.

Spark architecture

Walk me through Spark's runtime architecture.

Three components, per the cluster mode overview:

Driver — runs your main() function and holds the SparkContext. It builds the DAG, schedules tasks, and coordinates everything. If the driver dies, the application dies.
Cluster manager — an external service (Standalone, YARN, or Kubernetes) that allocates resources across applications.
Executors — processes on worker nodes that run the actual tasks and cache data. Each application gets its own executors for isolation; they live for the whole application.

The flow: driver requests resources from the cluster manager, executors launch, the driver ships code and tasks to them, executors run the work and report back.

What's the relationship between a job, a stage, and a task?

A job is triggered by one action. Spark splits each job into stages at shuffle boundaries — every wide transformation creates a new stage. Each stage is split into tasks, where one task processes one partition.

So the math is simple: number of tasks in a stage equals the number of partitions. If you have 200 partitions, that stage runs 200 tasks. This is the mental model you need before any tuning question.

What does the Catalyst optimizer do?

Catalyst is Spark SQL's query optimizer. It transforms your query through four phases: analysis (resolving table and column references), logical optimization (rule-based rewrites like predicate pushdown and constant folding), physical planning (generating candidate execution plans and picking the cheapest), and code generation (compiling the plan to JVM bytecode).

The practical takeaway for an interview: you write *what* you want, Catalyst decides *how* to run it. That's why DataFrame code that looks naive can still run efficiently — the optimizer cleans it up.

Performance tuning

What is a shuffle, and why is it expensive?

A shuffle redistributes data across partitions so that all rows with the same key land in the same place — needed for joins, groupBy, and aggregations. It's expensive because it writes intermediate files to disk and moves data across the network between executors.

When you cite shuffles as your top tuning concern, you sound like someone who has actually debugged a slow job. The standard advice: minimize wide transformations, filter *before* you join, and avoid groupByKey in favor of reduceByKey, which combines values locally before shuffling.

How does partitioning affect performance?

Partitions are the unit of parallelism — one task per partition. Too few partitions and you underuse the cluster; too many and the scheduling overhead per tiny task dominates. A common starting point is 2 to 4 partitions per CPU core.

Use repartition() to increase partitions (full shuffle) and coalesce() to decrease them (avoids a shuffle by merging existing partitions). Knowing that coalesce is the cheaper option for shrinking partition count is a frequent follow-up.

What is data skew, and how do you handle it?

Skew happens when one key holds far more rows than the others, so a single task does most of the work while the rest sit idle. One slow task drags out the whole stage.

Common fixes:

Salting — append a random suffix to the hot key to spread it across more partitions, then aggregate in two passes.
Broadcast join — if one side of a join is small, broadcast it to every executor so no shuffle happens on the skewed side.
Adaptive Query Execution (AQE) — enabled by default in modern Spark, it detects and splits skewed partitions at runtime.

When should you cache or persist a DataFrame?

Cache when you reuse the same dataset across multiple actions — without it, Spark recomputes the entire lineage each time. The classic case is an iterative algorithm or a dataset you query several times.

Don't cache blindly. Caching consumes executor memory, and if it forces other data to spill, it can make things slower. cache() uses memory by default; persist() lets you pick a storage level like MEMORY_AND_DISK for data too big for RAM.

What is a broadcast join?

When you join a large table with a small one, Spark can send a full copy of the small table to every executor. Each executor then joins locally with no shuffle of the large table — usually a massive speedup. Spark does this automatically when a table is under spark.sql.autoBroadcastJoinThreshold (10 MB by default), or you can force it with the broadcast() hint.

Structured streaming

What is Structured Streaming, and what's the core idea behind it?

Structured Streaming is Spark's stream-processing engine built on Spark SQL. The core abstraction: treat a live data stream as an unbounded table that keeps growing, and express your streaming logic with the same DataFrame/Dataset API you'd use for a batch job. New data is just new rows appended to the table.

This is the key insight to lead with — you don't learn a separate streaming API, you write batch-style code and Spark runs it incrementally.

How does Structured Streaming actually process data?

By default it uses a micro-batch engine. The streaming guide describes it as processing "data streams as a series of small batch jobs," achieving latencies as low as 100 milliseconds with exactly-once fault-tolerance guarantees.

Spark also offers a Continuous Processing mode for ~1 ms latency with at-least-once guarantees, but micro-batch is the default and the one to discuss first.

How does Structured Streaming guarantee exactly-once processing?

Two mechanisms: checkpointing tracks the offset range processed in each batch, and a write-ahead log records progress before committing. On failure, Spark replays from the last committed offset, and idempotent sinks ensure no row is written twice. This is also a good moment to mention watermarking — how Spark bounds state for late-arriving data in windowed aggregations.

Next step

Use Articuler to act on what you just read

Start with one concrete goal: investor intros, sales prospects, event meetings, hiring-manager outreach, or expert conversations. Articuler turns that goal into people, prep, and messages.

Start networking with intent

FAQ

How many Spark interview questions should I prepare?

Cover depth over breadth. Master the topics in this guide — architecture, the RDD/DataFrame distinction, shuffles, partitioning, and streaming basics — and you'll handle most of what a data engineering interview throws at you. Interviewers probe one area deeply rather than skimming twenty shallow definitions.

Do I need to know Scala for a Spark data engineering role?

Not usually. Most data engineering teams run Spark through PySpark (Python), and the DataFrame API is nearly identical across languages. Scala matters mainly if the role involves the Dataset API, which Python doesn't fully support, or contributing to internal Spark libraries. Check the job description.

What's the most common Spark interview mistake?

Treating questions as definition recall. When asked about RDDs vs. DataFrames, weak candidates list features; strong ones explain the performance consequence — that DataFrames go through Catalyst and RDDs don't. Always tie your answer back to *why it matters for a real job*.

How is a Spark interview different from a general data engineering interview?

A Spark round goes deep on the engine: shuffles, the DAG, memory management, and tuning. A broader data engineer interview also covers pipelines, modeling, SQL, and tools like Kafka for streaming ingestion. Expect Spark questions inside a data engineering loop, plus a dedicated deep-dive if the role is Spark-heavy.

Prepare for the people, not just the questions

Knowing the answers gets you in the door. What gets you the offer is reading the room — understanding what *this* interviewer cares about, what the team is actually building, and where your experience lines up with their pain. Spark expertise is table stakes for a data engineering role; the differentiator is showing up to the conversation already understanding the person across the table.

That's where Articuler helps. Instead of applying through a job board and hoping, you can find the hiring manager or the engineering lead behind a role, build a Playbook on their background and what they work on, and send a personalized note that gets a reply — roughly 8x the rate of a generic cold message. The fastest path into a data engineering team is rarely the apply button; it's a 15-minute conversation with the person hiring.

For more interview prep, see our guides on PySpark interview questions and how to ace an interview.