
Data analyst interviews test three things: can you write SQL under pressure, do you understand statistics well enough to interpret results honestly, and can you translate numbers into decisions a non-technical person will act on?
This guide covers 10 questions that come up regularly across entry and mid-level data analyst interviews — SQL JOINs and window functions, Python/pandas, hypothesis testing, A/B testing, data cleaning, and business scenario questions. Each answer is concise and ready to adapt.
SQL: JOINs and Window Functions
SQL shows up in roughly 85% of data analyst interviews. Expect at least one JOIN question and one window function question.
"Explain the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN."
SQL is a domain-specific language for managing relational data. JOINs are how you combine rows across tables — and interviewers use this question to check whether you understand what happens to unmatched rows, not just whether you know the syntax.
Sample answer:
> "An INNER JOIN returns only rows where there's a match in both tables — unmatched rows from either side are dropped. A LEFT JOIN keeps all rows from the left table, and fills in NULL for columns from the right table where there's no match. A FULL OUTER JOIN keeps all rows from both tables, using NULL to fill in missing values on whichever side has no match. In practice, I use INNER JOIN when I only care about complete records, and LEFT JOIN when I need to preserve a baseline table — for example, keeping all customers even if they haven't made a purchase yet."
"Write a query to find the top 3 salespeople by revenue in each region."
This is a window function question. Window functions let you perform calculations across a set of rows without collapsing them into a single output row — unlike GROUP BY.
Sample answer:
SELECT region, salesperson, total_revenue
FROM (
SELECT
region,
salesperson,
SUM(revenue) AS total_revenue,
RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rnk
FROM sales
GROUP BY region, salesperson
) ranked
WHERE rnk <= 3;> "I use RANK() OVER (PARTITION BY region ...) to assign a rank within each region. Wrapping it in a subquery lets me filter on rnk <= 3. I'd use DENSE_RANK() instead if ties shouldn't create gaps in the ranking."
Python and pandas
"How would you handle missing values in a dataset using Python?"
pandas is the standard Python library for data manipulation. Missing-value handling is one of the first things interviewers probe — the right answer depends on why the data is missing, not just what function to call.
Sample answer:
> "First I check the extent and pattern of missing data with df.isnull().sum(). Then I decide based on context: if a column is missing more than 30–40% of values and doesn't have a clear imputation strategy, I consider dropping it. For numeric columns with smaller gaps, I'll impute with the median (not mean) if the distribution is skewed. For categorical columns, I might impute with the mode or add a separate 'Unknown' category if the missingness itself carries signal. I always ask *why* the data is missing before touching it — sometimes a NULL in a 'churn_date' column means the customer is still active, not that the data is wrong."
"What does groupby() do in pandas, and when would you use it?"
Sample answer:
> "groupby() splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. I use it constantly — for example, df.groupby('region')['revenue'].sum() gives me total revenue by region in one line. It's useful any time you want to compute an aggregate (sum, mean, count, max) across a categorical variable. The common mistake is forgetting that groupby() produces a grouped object, not a DataFrame — you need to chain an aggregation method like .sum() or .agg() to get something usable."
Statistics: Hypothesis Testing and Distributions
"What is a p-value, and how do you interpret it?"
This question separates candidates who memorized a definition from those who've actually run tests.
Sample answer:
> "A p-value is the probability of observing a result at least as extreme as what we got, assuming the null hypothesis is true — that is, assuming there's no real effect. A p-value of 0.03 means there's a 3% chance we'd see this result by chance alone if nothing were actually going on. It does *not* mean the probability that the null hypothesis is true. The conventional threshold is 0.05, but I always pair p-value interpretation with effect size — a statistically significant result can still be practically meaningless if the effect is tiny."
"When would you use the mean vs. the median to describe a dataset?"
Sample answer:
> "Use the mean when the distribution is roughly symmetric and free of extreme outliers — salaries in a narrow professional band, test scores in a controlled setting. Use the median when the distribution is skewed or has outliers that would distort the average — household incomes, app session lengths, or property prices. A classic example: in a room of 10 people earning $50K/year, adding one person earning $10M makes the mean jump to about $960K, while the median barely moves. For most real-world business datasets, I report both and note the difference if they diverge meaningfully."
A/B Testing
"Walk me through how you'd design and analyze an A/B test."
A/B testing is a randomized experiment comparing two variants to determine which performs better. This is one of the highest-signal questions in a data analyst interview — it touches statistics, experimental design, and business judgment simultaneously.
Sample answer:
> "Start with the hypothesis: what are we testing and why? For example, 'Changing the CTA button from blue to green will increase click-through rate.' Next, define the primary metric (CTR), any guardrail metrics (don't hurt revenue), and the minimum detectable effect — how small a lift is worth shipping? Then run a power calculation to determine sample size and how long to run the test. During the test, avoid peeking at results early — that inflates false-positive rates. After the test, check randomization worked (run an A/A test or check baseline balance), then analyze with a two-proportion z-test or t-test depending on the metric. Report the p-value, confidence interval, and effect size — not just whether it was 'significant.'"
Data Cleaning
"What's your process for cleaning a new dataset you've never seen before?"
Data analysis in practice starts with data that's never quite ready to use. Interviewers ask this to see if you approach cleaning methodically or just start fixing things at random.
Sample answer:
> "I follow a consistent sequence. First, I understand the schema — what does each column represent, what's the expected data type, what's the grain of the table? Second, I profile the data: row count, null rates, distinct values, min/max per column. Third, I look for anomalies: duplicates, impossible values (negative ages, dates in the future), and inconsistent string formats (same category spelled three different ways). Fourth, I document every transformation I make — especially dropping rows or imputing values — so someone else can reproduce my work. I treat cleaning as part of the analysis, not a preliminary chore, because the decisions you make here affect every conclusion downstream."
Business Scenario Questions
"Our key metric dropped 20% last week. How do you diagnose it?"
This is a structured thinking question. There's no single right answer — interviewers want to see that you don't jump to conclusions.
Sample answer:
> "First, verify the drop is real: check for tracking errors, dashboard bugs, or data pipeline failures. A surprising number of 'metric drops' turn out to be instrumentation issues. Second, check if the drop is across the board or concentrated — broken down by platform, region, user segment, or acquisition channel. Third, look at correlated metrics: if revenue dropped, did traffic also drop, or did conversion rate drop while traffic held steady? Fourth, check for external causes: did we push a product change, run a promotion that ended, or are there seasonal patterns? I'd build a hypothesis list in order of likelihood and work through it systematically before calling it a real problem."
"How do you explain a technical analysis to a non-technical stakeholder?"
Communication is where many technically strong candidates lose points — being right about the data but unclear about the implication is a real failure mode.
Sample answer:
> "I lead with the business question, not the method. Instead of 'we ran a logistic regression,' I say 'we wanted to understand which customers are most likely to churn in the next 30 days.' Then I give the answer in one sentence: 'Customers who haven't logged in for 14 days are three times more likely to cancel.' After that, I show one clear chart — not six. If they want to see the details, I have the technical appendix ready, but I don't lead with it. The goal is to make the decision easy, not to demonstrate that I did complicated work."
FAQ
What SQL topics come up most in data analyst interviews?
JOINs (INNER, LEFT, FULL OUTER) and window functions (RANK, DENSE_RANK, ROW_NUMBER, LAG/LEAD) appear in the majority of SQL interview questions. CTEs and subqueries are also common for structuring complex queries.
Do data analyst interviews test Python?
Around 40–50% of data analyst interviews include Python questions, typically focused on pandas — filtering, grouping, merging DataFrames, and handling missing values.
How important is statistics knowledge for data analyst roles?
Interviewers focus on applied judgment: knowing when to use mean vs. median, how to interpret a p-value, and how to design and analyze an A/B test. You don't need advanced ML knowledge for most entry-to-mid-level roles.
How should I prepare for business scenario questions?
Practice the structured diagnostic approach: verify the data first, then segment, then correlate, then hypothesize external causes. A clear thinking process matters more than arriving at the exact right answer.
Land the Interview Before You Ace It
Answering these questions well gets you through the technical screen. But data analyst roles — especially at fast-growing companies — fill largely through referrals and warm intros, not blind applications. If you want to reach the hiring data manager directly, Articuler uses semantic search across 980M+ profiles to surface the exact person behind a role. Once you've found them, build a Playbook on that interviewer before the call — their background, what they care about, and the questions they're likely to ask. It's the same prep work, done in a fraction of the time.
And if you're working on your broader interview narrative — how you talk about yourself, your work, and your career arc — tell-me-about-yourself answers are worth locking in before any technical screen.