Q1
You're designing a data pipeline to ingest 500GB of daily data from multiple sources into a data warehouse. Walk us through your approach to partitioning, compression, and ensuring data quality at scale.
Why they ask this:* They want to assess your understanding of scalability, performance optimization, and practical ETL/ELT design patterns that directly apply to production systems.
Q2
Explain the difference between slowly changing dimensions (SCD) Type 1, Type 2, and Type 3 in a star schema. When would you use each, and how would you implement Type 2 in SQL or Spark?
Why they ask this:* This tests your knowledge of dimensional modeling, a core competency for data engineers building analytics-ready systems, and your ability to handle real-world data scenarios with historical tracking requirements.
Q3
You notice a Spark job that processes 2TB of data is running 40% slower than last week despite no code changes. What are the first five things you'd investigate to diagnose the issue?
Why they ask this:* They're evaluating your troubleshooting methodology, understanding of distributed computing bottlenecks, and ability to optimize performance in production environments.
Q4
Compare batch processing versus stream processing architectures. When would you choose Kafka + Flink over Airflow + Spark, and what are the trade-offs in latency, cost, and complexity?