Mid leveldata

Data Engineer
Interview Questions

Covering Data Engineer interview questions — pipelines, ETL, Spark, SQL, and data architecture.. Free, no signup required.

10 questions ready

Q1
You're designing a data pipeline to ingest 500GB of daily data from multiple sources into a data warehouse. Walk us through your approach to partitioning, compression, and ensuring data quality at scale.
Why they ask this:* They want to assess your understanding of scalability, performance optimization, and practical ETL/ELT design patterns that directly apply to production systems.
Q2
Explain the difference between slowly changing dimensions (SCD) Type 1, Type 2, and Type 3 in a star schema. When would you use each, and how would you implement Type 2 in SQL or Spark?
Why they ask this:* This tests your knowledge of dimensional modeling, a core competency for data engineers building analytics-ready systems, and your ability to handle real-world data scenarios with historical tracking requirements.
Q3
You notice a Spark job that processes 2TB of data is running 40% slower than last week despite no code changes. What are the first five things you'd investigate to diagnose the issue?
Why they ask this:* They're evaluating your troubleshooting methodology, understanding of distributed computing bottlenecks, and ability to optimize performance in production environments.
Q4
Compare batch processing versus stream processing architectures. When would you choose Kafka + Flink over Airflow + Spark, and what are the trade-offs in latency, cost, and complexity?
Q5
Tell me about a time when you discovered a critical data quality issue in a production pipeline. What was the situation, how did you identify it, what steps did you take to resolve it, and what did you implement to prevent it from happening again?
Q6
Describe a situation where you had to work with a data scientist or analytics team to understand their requirements for a new data source. How did you approach the conversation, what challenges arose, and how did you ensure the solution met their needs?
Q7
Share an example of when you had to refactor or optimize an existing data pipeline that was poorly designed or inefficient. What was your approach, how did you minimize disruption to stakeholders, and what was the measurable impact?
Q8
How would you handle a situation where a stakeholder requests a complex new data pipeline with a tight deadline, but you identify that the existing infrastructure cannot support it without significant refactoring? Walk me through how you'd communicate this and propose a solution.
Q9
What would you do if you inherited a critical data pipeline with no documentation, minimal tests, and frequent failures? How would you prioritize stabilizing it while adding new features?
Q10
Imagine you're proposing migrating a legacy on-premise data warehouse to a cloud platform. The team is resistant due to concerns about cost, data security, and migration risk. How would you structure your proposal and address their concerns?
🔒

7 questions locked

Upgrade to unlock all 10 questions with answer guides, videos & PDF

Upgrade to unlock →

Want questions tailored to a specific company?

Try the full generator →