Data Lakes & Data Warehouses
Structured vs unstructured data storage: data warehouses (Redshift, BigQuery) vs data lakes (S3 + Spark). ETL, ELT, and the lakehouse pattern.
OLTP vs OLAP: Two Different Worlds
Production databases (PostgreSQL, MySQL, DynamoDB) are OLTP — Online Transaction Processing. They're optimized for low-latency, high-concurrency reads and writes of individual rows. Analytics systems are OLAP — Online Analytical Processing. They run complex queries scanning billions of rows to answer business questions like 'What was our revenue by region last quarter?' These two workloads have fundamentally incompatible access patterns and require separate systems.
| Dimension | OLTP | OLAP |
|---|---|---|
| Optimized for | Low-latency row reads/writes | High-throughput column scans |
| Query type | Simple lookups, inserts | Complex aggregations, JOINs |
| Data volume | GB to TB | TB to PB |
| Storage layout | Row-oriented | Column-oriented |
| Examples | PostgreSQL, MySQL, DynamoDB | Redshift, BigQuery, Snowflake |
Data Warehouses
A data warehouse stores structured, schema-on-write data from multiple OLTP sources. Data is cleaned, transformed, and loaded via ETL (Extract, Transform, Load) pipelines. Warehouses use columnar storage — a query scanning only the `revenue` column reads only that column's data block, skipping all other columns. This gives 10-100x speedups on analytical queries vs row-oriented databases.
- Amazon Redshift: Columnar MPP (Massively Parallel Processing) database. Nodes distribute query execution. Sort keys and distribution keys control co-location.
- Google BigQuery: Serverless; you pay per query by bytes scanned. Dremel query engine. Excellent for ad-hoc analytics on petabyte datasets.
- Snowflake: Cloud-agnostic, separates compute from storage. Scale compute up/down independently. Popular for cross-cloud and multi-region analytics.
Data Lakes
A data lake stores raw, unprocessed data in its native format (JSON logs, Parquet files, CSV exports, images, videos) in cheap object storage (S3, GCS, Azure Data Lake). Unlike warehouses, data lakes use schema-on-read — structure is applied when the data is queried, not when it's stored. This enables storing data you don't know how to use yet.
ETL vs ELT
ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when warehouses were expensive and storage was limited — you only loaded clean, useful data. ELT (Extract, Load, Transform) loads raw data first, then transforms it using the warehouse's own compute. Modern cloud warehouses are powerful and cheap enough that ELT is now preferred — you retain raw data for future reprocessing.
dbt: The ELT Standard
dbt (Data Build Tool) is the industry-standard ELT transformation layer. Data engineers write SQL transformation models that dbt compiles and runs inside the data warehouse. It handles dependency ordering, testing, documentation, and incremental builds. If someone mentions 'the modern data stack', dbt is almost always part of it.
The Lakehouse Pattern
Data lakes and warehouses each have weaknesses: lakes lack ACID transactions and query optimization; warehouses are expensive and rigid. The lakehouse pattern combines both — open table formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID transactions, schema evolution, and time travel to data lake storage. Compute engines like Databricks or Spark query these tables with warehouse-like performance.
| Pattern | Strength | Weakness |
|---|---|---|
| Data Warehouse | Fast SQL queries, governed schema, BI-friendly | Expensive at scale, rigid schema, poor for ML raw data |
| Data Lake | Cheap storage, raw data, ML-friendly | No ACID, slow queries without optimization, schema chaos |
| Lakehouse | ACID on cheap storage, unified for BI + ML | Newer ecosystem, more operational complexity |
Interview Tip
Analytics architecture questions come up in senior system design interviews, especially for data platform or analytics roles. The key point to communicate: 'I would not run analytics on the production OLTP database — that would contend with user traffic. I'd set up a data warehouse fed by nightly ETL or real-time CDC (Change Data Capture) from the primary database, so analysts query the warehouse without impacting production.' This is the answer interviewers want to hear.