
If you are interviewing for a data, backend, or platform engineering role that touches event streaming, expect Apache Kafka to come up. This guide collects 30+ real Kafka interview questions grouped by topic, each with a concise sample answer you can adapt. Questions scale from junior to senior: junior rounds test whether you can define a topic, a partition, and a consumer group correctly; mid-level rounds probe delivery semantics and rebalancing; senior rounds push into replication math, exactly-once design, and the trade-offs of running Kafka at scale. Read the answers, then practice saying them out loud in your own words. Interviewers reward engineers who explain *why* Kafka behaves a certain way, not just *what* a setting does.
A quick map of where interviewers focus by level:
| Level | What they test | Typical questions |
|---|---|---|
| Junior | Core vocabulary and the mental model | What is a topic? What is an offset? How do consumer groups work? |
| Mid | Operational behavior and client config | Delivery semantics, rebalancing, acks, consumer lag |
| Senior | Design trade-offs and failure modes | ISR and min.insync.replicas, exactly-once, KRaft vs ZooKeeper, capacity planning |
Kafka Fundamentals
These questions establish that you understand the storage model. Get the vocabulary exactly right, because interviewers use loose answers here as a signal to dig deeper.
What is Apache Kafka, in one or two sentences? Apache Kafka is a distributed, append-only commit log used as an event streaming platform. Producers write records to topics, brokers store them durably and in order within each partition, and consumers read them at their own pace. Unlike a traditional message queue, Kafka retains records after they are read, so multiple independent consumers can replay the same stream.
What is a topic, and what is a partition? A topic is a named logical stream of records, like orders or clicks. Each topic is split into one or more partitions, and a partition is the actual ordered, immutable log on disk. Partitions are the unit of parallelism and ordering: Kafka guarantees ordering *within* a partition, not across the whole topic. You pick the partition count to match your throughput and consumer-parallelism needs.
What is an offset? An offset is a monotonically increasing integer that uniquely identifies each record within a partition. Consumers track the offset of the next record they want to read, which acts as a bookmark. Committed offsets let a consumer resume from where it left off after a restart, and they are stored in an internal compacted topic called __consumer_offsets.
What is a broker, and what is a Kafka cluster? A broker is a single Kafka server that stores partition data and serves produce and fetch requests. A cluster is a group of brokers working together. Each partition has one broker acting as its leader (handling reads and writes) and zero or more brokers holding follower replicas for fault tolerance. Spreading partitions and their replicas across brokers is how Kafka scales and survives node failure.
How does Kafka use keys when producing a record? If a producer sets a key, Kafka hashes that key to choose a partition, so all records with the same key land in the same partition and stay ordered relative to each other. This is how you keep, say, all events for one user_id in order. If the key is null, records are distributed across partitions (round-robin or sticky batching), which maximizes throughput but gives up per-key ordering.
What is log retention and log compaction? Retention controls how long records stay before deletion, set by time (retention.ms) or size (retention.bytes). Compaction is a different cleanup policy: instead of deleting by age, a compacted topic keeps at least the latest record for each key and removes older duplicates. Compaction is what makes __consumer_offsets and changelog topics work as a key-value snapshot.
Producers and Consumers
This section is the heart of most mid-level interviews. Expect questions about how clients balance throughput, ordering, and delivery guarantees.
How do consumer groups distribute work? A consumer group is a set of consumers that cooperatively read a topic by dividing its partitions among themselves. Each partition is assigned to exactly one consumer in the group at a time, which prevents duplicate processing and lets you scale out by adding consumers. If you have more consumers than partitions, the extras sit idle, so partition count caps your group parallelism.
What triggers a consumer group rebalance, and why does it matter? A rebalance reassigns partitions across the group. It is triggered when a consumer joins or leaves, when a consumer is considered dead after missing heartbeats, or when partitions are added to a subscribed topic. Rebalances matter because consumers pause processing during them; frequent rebalances ("rebalance storms") hurt throughput. Cooperative/incremental rebalancing and tuning session.timeout.ms and max.poll.interval.ms reduce the pain.
What is the `acks` setting and what do its values mean? acks controls how many replicas must acknowledge a write before the producer treats it as successful. acks=0 means fire-and-forget (fastest, can lose data). acks=1 means the leader acknowledges before replication (a leader failure can lose recent writes). acks=all means the leader waits for all in-sync replicas, which is the durable choice and the one you pair with min.insync.replicas for strong guarantees.
What is consumer lag, and how do you reason about it? Consumer lag is the difference between the latest offset produced to a partition (log-end offset) and the last offset the consumer has committed. Growing lag means consumers are falling behind producers. You diagnose it by checking whether the bottleneck is consumer processing time, too few partitions, an undersized consumer group, or slow downstream calls, then scale or optimize the right layer.
When would you manually commit offsets instead of using auto-commit? Auto-commit periodically commits offsets on a timer, which is simple but can commit records you have not finished processing, risking data loss on a crash. You commit manually (after successful processing) when you need at-least-once correctness, want to control commit boundaries around a batch, or are coordinating offsets with an external system or transaction.
The three delivery semantics come up constantly. Be ready to map each one to producer and consumer behavior.
| Delivery semantic | Guarantee | How you get it |
|---|---|---|
| At-most-once | Records may be lost, never duplicated | Commit offsets before processing; acks=0 or acks=1 |
| At-least-once | Records never lost, may be duplicated | Commit offsets after processing; acks=all, retries enabled |
| Exactly-once | Each record processed once, no loss or duplicates | Idempotent producer + transactions, or Kafka Streams EOS |
Reliability and Performance
Senior interviews live here. The questions test whether you understand Kafka's durability model deeply enough to make safe configuration calls.
What is replication and the in-sync replica (ISR) set? Each partition is replicated to a configurable number of brokers (replication.factor). One replica is the leader; the rest are followers that fetch from it. The in-sync replicas are the followers that are sufficiently caught up to the leader. A write is only considered committed once all ISR members have it, and only an ISR member can be elected leader if the current leader fails, which is how Kafka avoids losing acknowledged data.
How do `replication.factor` and `min.insync.replicas` work together? replication.factor is how many copies of each partition exist. min.insync.replicas is the minimum number of in-sync replicas that must acknowledge an acks=all write for it to succeed. A common durable setup is replication factor 3 with min.insync.replicas=2: you can lose one broker and still accept writes, but if two replicas are down the partition rejects writes rather than risk data loss.
What happens during leader election when a broker fails? If a partition leader's broker goes down, the cluster controller elects a new leader from the ISR, and producers and consumers transparently redirect to it. Because only in-sync replicas are eligible, the new leader already has all committed records. If no ISR member is available, you either wait or, if unclean.leader.election.enable=true, allow an out-of-sync replica to take over, which trades availability for possible data loss.
How does Kafka achieve exactly-once semantics? Two pieces combine. The idempotent producer assigns sequence numbers so the broker can deduplicate retried writes within a partition. Transactions then let a producer write to multiple partitions and commit consumer offsets atomically, so a read-process-write loop either fully commits or fully aborts. Kafka Streams packages this as exactly-once processing (EOS) with a single config, which is why interviewers like to ask about it.
Why is Kafka fast despite writing everything to disk? Kafka appends sequentially to log segments, and sequential disk I/O is far faster than random I/O. It relies on the OS page cache instead of a complex in-memory cache, uses zero-copy (sendfile) to move data from disk to network without passing through application memory, and batches and optionally compresses records to cut per-message overhead. Together these let a single broker push very high throughput on commodity hardware.
What does `enable.idempotence` do on the producer? It makes the producer safe to retry without creating duplicates, by tagging each batch with a producer ID and sequence number so the broker discards exact resends. In modern Kafka it is on by default and is a prerequisite for transactions. Mention that it guards against duplicates from retries but does not by itself give you cross-partition exactly-once, which needs transactions.
Architecture, Scaling, and the Ecosystem
These questions check that you can place Kafka in a real system and reason about operating it. KRaft versus ZooKeeper is the single most common "have you kept up?" question in 2026.
What is KRaft, and how does it differ from ZooKeeper? Historically Kafka used Apache ZooKeeper to store cluster metadata such as broker membership, topic configs, and partition leadership. KRaft (Kafka Raft) replaces ZooKeeper with a built-in Raft-based metadata quorum running inside Kafka itself. KRaft removes the external dependency, scales to far more partitions, and recovers from controller failures faster because metadata is itself a replicated log. As of Kafka 4.0 it is the default, and ZooKeeper mode has been removed, so know KRaft as the present-day answer.
How do you decide how many partitions a topic needs? Partition count is driven by target throughput and the maximum consumer parallelism you want, since one partition maps to at most one consumer per group. More partitions raise parallelism but also increase open file handles, end-to-end latency, replication overhead, and rebalance and failover time. The practical approach is to estimate peak throughput per partition, size for headroom, and avoid over-partitioning because reducing partition count later is disruptive.
What is Kafka Connect and when would you use it? Kafka Connect is a framework for streaming data between Kafka and external systems using reusable connectors instead of custom code. Source connectors pull data into Kafka (for example, change-data-capture from a database) and sink connectors push data out (to a warehouse, search index, or object store). It runs as a scalable, fault-tolerant cluster of workers and handles offset tracking and restarts, which is why teams prefer it over hand-written integration jobs.
What is Kafka Streams, and how does it differ from a consumer application? Kafka Streams is a client library for building stateful stream-processing applications directly on Kafka, with no separate processing cluster. It gives you abstractions like streams and tables, windowed aggregations, joins, and fault-tolerant local state backed by changelog topics. A plain consumer just reads records; Kafka Streams adds the processing topology, state management, and exactly-once support on top, so you reach for it when your logic is more than "consume and forward."
How would you size and monitor a Kafka cluster in production? Start from throughput (MB/s in and out), retention (how much data you must keep), and replication factor, which together drive disk and broker count. Monitor under-replicated partitions, ISR shrink/expand events, request and produce/fetch latency, consumer lag, and disk and network saturation. The signals interviewers want to hear are under-replicated partitions (a durability risk) and rising consumer lag (a throughput problem), plus a plan to scale brokers or partitions before they become incidents.
Know Who Is Interviewing You
Strong Kafka answers get you to the offer, but they are only half the preparation. The other half is knowing the person across the table: a staff data engineer will grill you on exactly-once and partition sizing, while a platform lead may care more about operability and on-call burden. Tailoring your examples to their world is what makes you memorable.
Articuler helps you do exactly that. It searches 980M+ professional profiles with semantic matching to find your actual interviewers, the hiring manager, and the data-platform lead behind the role, then builds a Playbook on each of them: their background, the systems they have built, and what they likely value. From that it drafts AI-personalized outreach that earns roughly an 8x reply rate compared with the 5-8% typical of cold messages. If you are a jobseeker, walking into a Kafka interview already knowing your interviewer's priorities is a real edge. See how it works at Articuler and on the find the right people page.
For broader preparation, pair this guide with our data engineer interview questions, system design interview questions, and AWS interview questions, and review the fundamentals in our technical interview questions overview.
Conclusion
Kafka interviews reward a clean mental model more than memorized configs. If you can explain the storage layer (topics, partitions, offsets, brokers), reason about consumer groups and delivery semantics, defend a durability setup using ISR and min.insync.replicas, and speak to KRaft, Connect, and Streams as a coherent ecosystem, you will handle the junior-through-senior progression that most interviewers follow. Practice answering out loud, ground each answer in *why* Kafka behaves that way, and you will sound like someone who has actually run it.
FAQ
How hard are Kafka interview questions? Difficulty scales with the role. Junior questions ask you to define core concepts like topics, partitions, and consumer groups. Senior questions test trade-offs: replication factor versus min.insync.replicas, how exactly-once is actually implemented, and how you would size and monitor a cluster. Solid fundamentals plus the ability to explain *why* will carry most rounds.
What is the most common Kafka interview question? Some version of "explain how consumer groups and partitions interact" appears in nearly every Kafka interview, because it tests ordering, parallelism, and rebalancing in one answer. Close behind are the three delivery semantics (at-most-once, at-least-once, exactly-once) and, in 2026, KRaft versus ZooKeeper.
Do I need to know ZooKeeper if Kafka now uses KRaft? You should understand what ZooKeeper did (stored cluster metadata) and why KRaft replaced it (no external dependency, more partitions, faster failover). Since Kafka 4.0 defaults to KRaft and removed ZooKeeper mode, KRaft is the current answer, but interviewers still ask about the migration and the historical role of ZooKeeper.
How do I prepare for a Kafka system design interview? Practice end-to-end designs: choose a partition count and key strategy for ordering, pick a replication factor and min.insync.replicas for durability, decide on a delivery semantic, and explain how you would handle consumer lag and broker failure. Pair this with our system design interview questions guide.