25  Introduction to Data Warehousing and Data Lakes

25.1 Why Warehouses and Lakes Matter

Operational systems run the business. Warehouses and lakes let the business understand itself.

Operational databases — transactional systems that capture orders, payments, customer interactions, sensor readings — are optimised for capturing one record at a time, very fast. They are not optimised for asking how have customer cohorts evolved over the last five years across product lines and regions? The same query that runs in milliseconds on an operational record can take hours when extended to a million records, and it competes for resources with the live business.

A separate analytical layer is therefore the standard pattern in any organisation of meaningful scale. Almost every dashboard the firm relies on — Tableau workbooks, Power BI reports, Looker dashboards, Qlik apps — is built on top of this analytical layer rather than on the operational systems directly. A board’s quarterly Balanced-Scorecard view, a regional manager’s weekly performance dashboard, a public-health choropleth — all of them depend on a warehouse, a lake, or a lakehouse to bring the right data into the right shape at the right speed for the chart to refresh in seconds rather than minutes.

The two dominant forms of this analytical layer — the data warehouse and the data lake — and their convergence into the lakehouse are the subject of this chapter. The foundational text on the warehouse remains William H. Inmon (2005), and the framing paper for the lakehouse architecture is Michael Armbrust et al. (2021).

25.2 The Data Warehouse

A Data Warehouse is a centralised repository of integrated, historical, subject-oriented data, optimised for analytical query rather than for transactional capture. The standard definition due to Inmon characterises a warehouse as subject-oriented, integrated, non-volatile, and time-variant.

  • Subject-oriented: Organised around business subjects (customer, product, sales) rather than around source systems.
  • Integrated: Inconsistencies in naming, units, and formats are resolved during loading.
  • Non-volatile: Data is added rather than modified; historical state is preserved.
  • Time-variant: Every fact carries a temporal anchor so trends can be analysed.

A data warehouse is the centre of gravity of traditional business intelligence: BI tools, dashboards, and reports query the warehouse, not the operational systems.

25.2.1 Two Foundational Schools

TipInmon versus Kimball
Dimension Inmon (Top-Down) Kimball (Bottom-Up)
Starting point Enterprise-wide data model in third normal form Business-process-aligned dimensional model
Build pattern Build the central warehouse first, then derive data marts Build subject-area data marts first, integrate later
Modelling Normalised relational schema Star schema with facts and dimensions
Time to first value Slower; large up-front design Faster; iterative
Maintenance Easier as scope grows Risk of integration debt across marts
Best for Long-lived enterprise programmes with strong governance Iterative business-led analytics with quick wins

The two schools are usually combined in practice. Most modern programmes adopt the warehouse-as-source-of-truth idea from Inmon and the dimensional-modelling discipline from Kimball.

25.3 Dimensional Modelling

flowchart TD
    F["Fact Table<br>Sales (date, store,<br>product, customer,<br>quantity, amount)"]
    D1["Date<br>Dimension"] --> F
    D2["Store<br>Dimension"] --> F
    D3["Product<br>Dimension"] --> F
    D4["Customer<br>Dimension"] --> F
    style F fill:#e3f2fd,stroke:#1976D2
    style D1 fill:#e8f5e9,stroke:#388E3C
    style D2 fill:#fff8e1,stroke:#F9A825
    style D3 fill:#fff3e0,stroke:#EF6C00
    style D4 fill:#fce4ec,stroke:#AD1457

Kimball’s dimensional model organises analytical data into two kinds of table:

  • Fact tables: Numerical measurements of business events (sales, orders, support tickets, transactions). One row per event. Contain measures (quantity, amount, duration) and foreign keys to the dimensions.

  • Dimension tables: Descriptive context for the facts (date, customer, product, store, employee). Wide tables with one row per entity, containing attributes used to filter, group, and label.

Two arrangements are common:

  • Star Schema: Fact in the centre, denormalised dimensions as points of the star. Optimised for query speed.
  • Snowflake Schema: Dimensions normalised into multiple tables. More space-efficient, slower to query.

The star schema is the default for analytical workloads on modern columnar databases.

25.3.1 Slowly Changing Dimensions

Dimensions change over time — a customer moves city, a product is reclassified, an employee changes department. Slowly Changing Dimensions (SCDs) are the patterns for handling those changes:

  • Type 0 — Retain Original: The original value is preserved; the new value is ignored.
  • Type 1 — Overwrite: The new value replaces the old; history is lost.
  • Type 2 — Add Row: A new row is added with effective-from and effective-to timestamps; full history is preserved.
  • Type 3 — Add Column: Add columns for current and previous value; only one prior version is preserved.
  • Type 4 — History Table: Current dimension table plus a separate history table.
  • Type 6 — Hybrid (1+2+3): Combines overwrite, history rows, and current-value columns.

The most common in practice are Type 1 (overwrite) and Type 2 (full history). The choice depends on whether the analyst needs to reproduce the dimension’s state at any historical point.

25.4 The Data Lake

A Data Lake is a centralised repository for raw data, stored in its native format, on cheap object storage. Where a warehouse imposes structure at write time (schema-on-write), a lake captures everything as-is and imposes structure when the data is read (schema-on-read).

The lake answers a different problem: how to land all of the firm’s data, including unstructured and semi-structured forms, at a cost that allows it to be retained for years even before its analytical use is known.

The principal characteristics of a data lake:

  • Storage on object stores: S3, Azure Data Lake Storage, Google Cloud Storage. Cheap, durable, infinitely scalable.
  • Schema-on-read: Data is parsed when queried, not when written.
  • All formats: Structured (CSV, Parquet, ORC), semi-structured (JSON, XML, Avro), unstructured (text, image, audio, video).
  • Open file formats: Parquet and ORC for columnar analytical workloads, Avro for streaming, raw formats for unstructured data.
  • Variable governance: Lakes range from carefully governed zoned lakes to ungoverned data swamps.

25.4.1 From Lake to Swamp

The defining failure mode of a data lake is the data swamp: a vast, ungoverned, unindexed pile of raw files that no one trusts and no one can find anything in. Avoiding the swamp requires:

  • Zoning: A documented progression from raw (immutable landing zone) through standardised (cleaned, deduplicated) to curated (analytical-grade) layers. Often called bronze, silver, gold in the medallion architecture.
  • Catalogue and metadata: A searchable inventory of every dataset, with schema, lineage, and ownership.
  • Quality at curation: Quality controls move data from the standardised to the curated zone; nothing reaches curated without passing them.
  • Lifecycle policies: Retention, archival, and deletion policies that prevent the lake growing without bound.

Without these, even a well-funded lake degrades into a swamp within a year or two.

25.5 The Data Lakehouse

The Data Lakehouse is a recent architectural convergence that places warehouse-class transactional, schema, and SQL features on top of a data-lake storage layer. The name and most influential framing come from Michael Armbrust et al. (2021) in their CIDR paper, which argues that the historical separation of warehouse and lake produced unnecessary duplication of data, governance overhead, and complexity, and that a unified platform combining the strengths of both is now technically feasible.

The defining capabilities of a lakehouse:

  • ACID transactions on lake-format files: Concurrent reads and writes with consistency guarantees.
  • Schema enforcement and evolution: Tables have schemas; changes are versioned.
  • Time travel: Queries can access historical snapshots of any table.
  • Unified governance: Access control, lineage, and audit across analytical, machine-learning, and BI workloads.
  • Open table formats: Delta Lake, Apache Iceberg, and Apache Hudi add transactional layers on top of Parquet files.

The lakehouse is not the right answer for every workload — high-frequency single-row operational reads still belong in a transactional database — but for the analytical workloads that previously required separate warehouse and lake platforms, it is increasingly the default.

25.6 Architectural Patterns

TipCommon Architectural Patterns
Pattern Origin Idea
Inmon Top-Down Bill Inmon, 1990s Build a normalised enterprise warehouse first, derive data marts from it
Kimball Bottom-Up Ralph Kimball, 1990s Build dimensional data marts for each business process, integrate via conformed dimensions
Data Vault Dan Linstedt, 2000s Hub-link-satellite model emphasising historisation and source traceability
Medallion (Bronze-Silver-Gold) Databricks, 2020s Lakehouse zoning from raw through cleaned to curated
Lambda Architecture Nathan Marz, early 2010s Parallel batch and streaming pipelines reconciled at query time
Kappa Architecture Jay Kreps, 2010s Streaming-only pipeline with no separate batch layer
Data Mesh Zhamak Dehghani, late 2010s Decentralised, domain-owned data products with federated governance

The right pattern depends on scale, maturity, and organisation. A small or mid-size firm rarely needs more than a Kimball-style dimensional model on a cloud warehouse. A global enterprise with hundreds of domains may run a federated mesh on a lakehouse foundation.

25.7 Cloud Platforms

TipModern Cloud Analytical Platforms
Platform Vendor Distinctive Feature
Snowflake Snowflake Inc. Multi-cloud cloud-native warehouse; clear separation of storage and compute
BigQuery Google Cloud Serverless warehouse; on-demand and pay-per-query
Redshift AWS AWS-native warehouse; strong integration with S3 and AWS ML
Synapse Analytics Microsoft Azure Unified warehouse, lake, and pipeline platform
Databricks Databricks Lakehouse on top of cloud object storage; Delta Lake; strong ML and Spark integration
Fabric Microsoft Successor to Synapse; lakehouse plus BI integrated with Power BI
Apache Iceberg Open-source Open table format used across many platforms (Snowflake, Trino, Spark, Flink)
Apache Hudi Open-source Open table format optimised for incremental and streaming updates

The choice of platform is rarely about query speed or feature parity any more — most platforms cover the bulk of analytical workloads adequately — and increasingly about ecosystem fit, cost model, governance integration, and the team’s existing skills.

25.8 Choosing Warehouse versus Lake versus Lakehouse

TipWhen to Choose Which
Property Warehouse Lake Lakehouse
Data shape Structured Mixed structured, semi-structured, unstructured All forms
Schema enforcement Strict (schema-on-write) Loose (schema-on-read) Configurable
Storage cost Higher Very low Low
Query speed (SQL) Very fast Slower Fast (with optimisation)
ML and unstructured Limited Native Native
Governance maturity Very mature Less mature Maturing rapidly
Best for BI, reporting, dashboards Data science, ML, archival Mixed workloads, modern unified platform

A pragmatic decision rule:

  • For pure structured BI workloads in a small to mid-size firm, a cloud warehouse (Snowflake, BigQuery, Redshift, Synapse) is often the simplest answer.
  • For mixed structured and unstructured analytics at scale, a lakehouse (Databricks, Snowflake with Iceberg, Microsoft Fabric) is increasingly the default.
  • For pure unstructured archival and exploration, a data lake on object storage with a catalogue and quality controls.

Many organisations start with a warehouse and add lake or lakehouse capability as their data and analytical needs grow.

25.9 Warehouses, Lakes, and the BI Layer

The analytical layer is not the dashboard, but every dashboard depends on it. Choices about warehouse, lake, and lakehouse design propagate directly into what BI tools can render, how dashboards refresh, and whether reports across the firm agree.

  • Star schema and BI performance: Tableau, Power BI, Looker, and Qlik all perform best against star-schema warehouses. Dashboards that join half a dozen normalised tables on every refresh are slow; dashboards that read from a denormalised fact-and-dimension model render in seconds.

  • Conformed dimensions and consistent dashboards: A single conformed customer, product, date, and region dimension is what allows a Tableau workbook in marketing and a Power BI report in finance to agree on basic facts. Without conformance, the same KPI on two dashboards will show two numbers and trust collapses.

  • Slowly changing dimensions and historical charts: A trend line that reads Q3 sales by region is correct only if the dimension carries the regional definition that applied during Q3. Type 2 SCDs are what make point-in-time visualisations honest.

  • Aggregate awareness in the BI tool: Modern BI tools (Looker’s aggregate awareness, Tableau’s aggregate extracts, Power BI’s aggregations) automatically route a dashboard query to a pre-aggregated rollup when one exists. Combined with a star-schema warehouse, this is what makes interactive drill-down dashboards on billion-row tables feel instant.

  • Live versus extract: Tableau’s live-vs-extract and Power BI’s DirectQuery-vs-Import are explicitly warehouse-engagement decisions. Modern cloud warehouses (Snowflake, BigQuery, Databricks SQL) have made live connections far more practical for interactive dashboards.

  • Lakehouses and unstructured visualisation: A pure warehouse cannot serve image, audio, or document content to BI dashboards. Lakehouses bring quality images, supplier PDFs, and call-centre transcripts into the same governed layer where structured KPIs live, enabling dashboards that mix structured numbers with unstructured artefacts.

  • Semantic layers across BI tools: dbt’s semantic layer, Cube, and similar tools express metric definitions once and serve them to any BI tool that connects. A dashboard built in Tableau and another built in Power BI both read the same revenue calculation, eliminating the dashboard-disagreement problem at its root.

  • Cost discipline and dashboard design: Pay-per-query warehouses charge for every dashboard refresh. A single high-traffic Tableau extract refresh against BigQuery can produce a meaningful monthly bill. Dashboard designers should be aware of the cost shape of the warehouse behind them.

25.10 Common Pitfalls

  • Treating the Warehouse as the Production System: Pointing operational applications at the analytical store and watching their performance degrade. The warehouse is for analysis, not for live transactions.

  • Lake Without Governance: Building a data lake without zoning, cataloguing, and quality controls. The lake becomes a swamp within twelve months.

  • Star Schema for Highly Volatile Dimensions: Using a Type 1 dimension where Type 2 history is needed for trend analysis. The history is silently lost.

  • One-Size-Fits-All Modelling: Forcing every analytical question into a single rigid star schema. Some questions are better served by a snowflake or a normalised structure.

  • Underestimating Slowly Changing Dimensions: Designing dimensions without considering how change will be handled, then patching SCDs in later at cost.

  • No Conformed Dimensions: Building data marts in isolation without a shared customer, product, or date dimension. The marts cannot be joined consistently.

  • Ignoring Lineage: A warehouse where consumers cannot trace a number back to its source field and the transformations that produced it.

  • Schema Sprawl in the Lake: Hundreds of poorly documented schemas in the lake; the same logical entity defined differently in three places.

  • Lakehouse Without Format Discipline: Mixing Parquet, Delta, Iceberg, and Hudi tables in one platform without a stewardship strategy; tools that work with one format but not another cause friction.

  • Migrating Without Strangler Pattern: Replacing the entire warehouse in a single big-bang cut-over. The risk and disruption almost always exceeds expectations.

  • No Cost Controls: A pay-per-query warehouse with no budget alerts. A single misconfigured pipeline produces a five-figure bill.

  • Treating Lake or Warehouse as a Project: A warehouse or lake is a continuous capability, not a project that finishes.

  • Forgetting the Reverse Path: Investing in moving data into the analytical layer and ignoring how analytical results return to operational systems where decisions are taken.

25.11 Illustrative Cases

The following short cases illustrate warehouse, lake, and lakehouse choices in practice. They describe common situations and the architectural reasoning.

A Mid-Size Bank’s Cloud Warehouse

A mid-size bank consolidates analytical reporting across deposits, loans, cards, and customer support. The team builds a Kimball-style dimensional model on Snowflake, with conformed customer, product, branch, and date dimensions, fact tables for each business event (account opening, transaction, loan application, support ticket), and Type 2 SCDs for customer and branch dimensions. BI tools query the warehouse directly. The architecture is conventional, well-understood, and well-suited to the bank’s predominantly structured workloads.

A Telecom Operator’s Data Lake for Network Analytics

A telecom operator captures network telemetry — call detail records, signalling messages, base-station logs — at very high volume and variety. The team builds a data lake on object storage, with bronze (raw landing), silver (parsed and standardised), and gold (curated analytical tables) zones. The lake holds data that the team would never have been able to keep in a warehouse at acceptable cost. ML models for network optimisation, fraud detection, and capacity planning train on the lake; reporting workloads pull aggregates from the gold zone.

A Manufacturing Group’s Lakehouse

A diversified manufacturing group operates dozens of plants with mixed data — ERP transactions, IoT sensor streams, quality images, supplier documents. A traditional warehouse cannot hold the unstructured data; a traditional lake cannot serve BI workloads at adequate speed. The team builds a Databricks lakehouse on Delta Lake, with the medallion zoning, ACID guarantees on key tables, and SQL access for BI alongside Spark and ML access for data science. The architecture serves both worlds without duplicating data.

Migration from Legacy Warehouse to Cloud Warehouse

A retailer running a decade-old on-premise warehouse migrates to BigQuery. The team uses the strangler pattern: new data marts are built in BigQuery first, the existing warehouse continues to operate, and consumers are migrated one by one over eighteen months. At cut-over, the legacy platform is decommissioned and only an archival snapshot remains. The migration completes on schedule and with no major outage.

A Lake That Became a Swamp

A consumer-goods firm sets up a data lake without zoning, cataloguing, or quality controls. After eighteen months, the lake holds tens of thousands of files, no one knows which are current, and no one trusts any of them. The remediation programme zones the lake retrospectively, deploys a catalogue (DataHub), implements quality contracts at the silver-gold transition, and decommissions everything in the raw zone older than the new retention policy. Two years later, the lake is functional. The remediation cost considerably more than building it correctly the first time.


25.12 Hands-On Exercise: Building a Star-Schema Model in Power BI

Aim: Build a Kimball-style star-schema model in Power BI — one fact table, four dimension tables — and exercise it with DAX measures that demonstrate the time-intelligence and cross-table aggregation a star schema is designed to support.

Scenario: Yuvijen Stores Pvt Ltd wants its first proper analytical model. The team has the operational tables (orders, customers, products, stores) but no integrated dimensional model. Build the star schema in Power BI, applying the principles from this chapter.

Deliverable: A Power BI file with a visible star-schema canvas (one fact, four dimensions), a date table marked as such, two illustrative slowly-changing-dimension treatments, and four DAX measures that exercise the model.

25.12.1 Step 1 — The Source Tables

TipThe Five Source Files
File Role Grain
sales.csv Fact One row per order line
dim_customer.csv Dimension One row per customer
dim_product.csv Dimension One row per product
dim_store.csv Dimension One row per store
dim_date.csv Dimension One row per calendar date for 2024–2026

A small extract:

sales.csv columns Description
order_id, customer_id, product_id, store_id, order_date, quantity, amount Fact-table foreign keys plus measures
dim_date.csv columns Description
date, year, quarter, month_no, month_name, day_of_week, is_weekend, fiscal_year Standard date attributes for analysis

25.12.2 Step 2 — Load and Disable Auto-Detect Relationships

In Power BI Desktop:

  1. File → Options → Current File → Data Load, untick Autodetect new relationships after data is loaded. Auto-detect produces inconsistent results and obscures the explicit modelling step.
  2. Home → Get Data → Text/CSV for each of the five files. Click Transform Data for each to verify types — particularly that date fields land as Date type and quantity/amount as numeric.
  3. Close & Apply to load.

The five tables now appear in the Fields pane with no relationships defined.

25.12.3 Step 3 — Build the Star Schema in the Model View

Switch to the Model view (the third icon on the left ribbon).

  1. Drag the four dimension tables into a circle around the Sales fact table.
  2. Click and drag from Sales[customer_id] to dim_customer[customer_id] — Power BI creates the relationship automatically.
  3. Repeat for Sales[product_id] ↔︎ dim_product[product_id], Sales[store_id] ↔︎ dim_store[store_id], and Sales[order_date] ↔︎ dim_date[date].
  4. Verify each relationship is many-to-one (* to 1) with single-direction filtering from the dimension to the fact.
  5. The canvas now shows a clean star: one central fact, four spokes outward.

If any auto-relationship persists despite the setting in Step 2, right-click the line and delete it before drawing the explicit one.

25.12.4 Step 4 — Mark the Date Table as a Date Table

Power BI’s time-intelligence functions (TOTALYTD, SAMEPERIODLASTYEAR, DATESINPERIOD) require a properly marked date table:

  1. Click dim_date in the Fields pane.
  2. Modeling → Mark as date table.
  3. Choose dim_date[date] as the date column.
  4. Confirm — Power BI verifies the column is contiguous and unique.

Without this step, time-intelligence DAX returns blank silently and the analyst spends an hour debugging.

25.12.5 Step 5 — Demonstrate Slowly Changing Dimension Patterns

Build two illustrations of SCD treatment using dim_customer:

Type 1 — Overwrite (history lost)

The default Power BI behaviour. When a customer changes city, the dimension is updated in place and the city field reflects only the current city.

-- A measure that always uses current city
Sales by Current City =
CALCULATE([Total Sales], dim_customer[city])

Type 2 — Add Row (history preserved)

Add an effective_from and effective_to column to a duplicate of the customer dimension (dim_customer_history). Each historical state of the customer becomes a separate row, with effective_to = NULL for the current row. Join the fact table on a surrogate key that respects time.

The Type 2 model lets a chart of Sales by city last year show the city the customer lived in last year, not the city they live in today — the difference between honest historical visualisation and silently misleading visuals.

25.12.6 Step 6 — Write Measures That Exercise the Star

Add four DAX measures in the Modeling tab:

Total Sales =
SUM(Sales[amount])

Sales YTD =
TOTALYTD([Total Sales], dim_date[date])

Sales SPLY = -- Same Period Last Year
CALCULATE([Total Sales], SAMEPERIODLASTYEAR(dim_date[date]))

YoY Growth % =
DIVIDE([Total Sales] - [Sales SPLY], [Sales SPLY])

Top 5 Products by Sales =
CONCATENATEX(
    TOPN(5, VALUES(dim_product[product_name]), [Total Sales]),
    dim_product[product_name],
    ", "
)

The five measures together demonstrate that the star schema enables time-intelligence (YTD, SPLY), ranking (TOPN), and cross-table aggregation — without any explicit joins in the DAX code.

25.12.7 Step 7 — Build the Demonstration Report

Switch to the Report view and build a single page that exercises every element of the model:

  • A line chart of Total Sales by month, with Sales SPLY overlaid.
  • A bar chart of YoY Growth % by product category.
  • A KPI card showing Sales YTD with the prior YTD as comparison.
  • A table of Top 5 Products by Sales by region.
  • Slicers for dim_date[year], dim_store[region], and dim_customer[segment].

Every visual respects the model’s filter propagation. Selecting a region in the slicer simultaneously filters all four dimensions through to the fact table — that is the power of a clean star schema.

25.12.8 Step 8 — Verify Performance and Storage

In Power BI Desktop, open View → Performance Analyzer to see how long each visual takes to render. A well-designed star schema with appropriately marked date table and conformed dimensions should render every visual in under 200 milliseconds for tens of millions of rows.

Compare the star-schema design with a flat-table alternative (everything denormalised into one wide table). Power BI’s column-store engine compresses the star schema substantially better than the flat alternative — typically 40–60 % smaller for the same data — because dimension values repeat far less than they would in a single denormalised table.

25.12.9 Step 9 — Connect to the Visualisation Layer

Several visualisation patterns from earlier in the book now make sense as features of the star schema rather than as features of the chart:

  • Conformed dimensions are why the Customer Region in the marketing dashboard and the Customer Region in the finance dashboard agree.
  • Type 2 SCDs are why a historical trend chart can show the firm’s revenue by region as the regions stood in each historical year, instead of silently re-applying today’s regional definitions to old data.
  • Time-intelligence DAX is what lets every chart in the report carry an automatic Same Period Last Year comparison without writing a separate query for each.
  • The Top 5 measure drives a ranking chart on the dashboard with no custom SQL.

The hands-on closes the loop on the chapter: the star schema is the discipline that makes the visualisation layer above it both fast and correct.

TipFiles and Screen Recordings

Power BI file (yuvijen-star-schema.pbix), the five source CSVs, and screen recordings of the Model view, the date-table marking, the SCD Type 2 illustration, and the time-intelligence measures will be embedded here.


Summary

Concept Description
Foundations
Why Warehouses and Lakes Matter Operational systems run the business; warehouses and lakes let the business understand itself
Data Warehouse Definition
Data Warehouse Centralised repository of integrated historical subject-oriented data optimised for analytical query
Subject-Oriented Organised around business subjects rather than around source systems
Integrated Inconsistencies in naming, units, and formats are resolved during loading
Non-Volatile Data is added rather than modified; historical state is preserved
Time-Variant Every fact carries a temporal anchor so trends can be analysed
Inmon and Kimball Schools
Inmon Top-Down Build a normalised enterprise warehouse first, derive data marts from it
Kimball Bottom-Up Build dimensional data marts for each business process, integrate via conformed dimensions
Dimensional Modelling
Fact Table Numerical measurements of business events with foreign keys to dimensions
Dimension Table Descriptive context for the facts, used to filter, group, and label
Star Schema Fact in the centre with denormalised dimensions as points; query-optimised
Snowflake Schema Dimensions normalised into multiple tables; space-efficient, slower
Slowly Changing Dimensions
SCD Type 0 Original value preserved; new value ignored
SCD Type 1 Overwrite the old value; history lost
SCD Type 2 Add a new row with effective-from and effective-to; full history preserved
SCD Type 3 Add columns for current and previous; only one prior version preserved
SCD Type 4 Current dimension table plus a separate history table
SCD Type 6 Hybrid combining overwrite, history rows, and current-value columns
Data Lake
Data Lake Centralised repository for raw data on cheap object storage; schema-on-read
Schema-on-Write vs Schema-on-Read Warehouse imposes structure at write; lake imposes it at read
Object Storage S3, Azure Data Lake Storage, Google Cloud Storage; cheap, durable, scalable
Open File Formats Parquet and ORC columnar; Avro for streaming; raw for unstructured
Avoiding the Swamp
Data Swamp Failure mode of an ungoverned lake: vast pile of raw files no one trusts
Lake Zoning Documented progression from raw to standardised to curated; bronze-silver-gold
Catalogue and Metadata Searchable inventory of every dataset with schema, lineage, and ownership
Quality at Curation Quality controls move data from standardised to curated zone
Lifecycle Policies Retention, archival, and deletion policies that prevent the lake growing without bound
Data Lakehouse
Data Lakehouse Architectural convergence placing warehouse-class features on top of lake storage
ACID on Lake Files Concurrent reads and writes with consistency guarantees on lake-format files
Schema Enforcement and Evolution Tables have schemas; changes are versioned and managed
Time Travel Queries can access historical snapshots of any table
Unified Governance Access control, lineage, and audit across analytical, ML, and BI workloads
Delta Lake Open table format on Parquet adding transactions; Databricks-led
Apache Iceberg Open table format adopted across many platforms including Snowflake and Trino
Apache Hudi Open table format optimised for incremental and streaming updates
Architectural Patterns
Data Vault Hub-link-satellite model emphasising historisation and source traceability
Medallion Architecture Lakehouse zoning from raw bronze through cleaned silver to curated gold
Lambda Architecture Parallel batch and streaming pipelines reconciled at query time
Kappa Architecture Streaming-only pipeline with no separate batch layer
Data Mesh Decentralised domain-owned data products with federated governance
Cloud Platforms
Snowflake Multi-cloud cloud-native warehouse; clear separation of storage and compute
BigQuery Serverless warehouse with on-demand pay-per-query pricing
Redshift AWS-native warehouse with strong S3 and ML integration
Synapse Analytics Microsoft Azure unified warehouse, lake, and pipeline platform
Databricks Lakehouse on cloud object storage; Delta Lake; strong ML and Spark integration
Microsoft Fabric Successor to Synapse; lakehouse plus BI integrated with Power BI
Choosing Warehouse vs Lake vs Lakehouse
Warehouse Best Use BI, reporting, and dashboards over predominantly structured data
Lake Best Use Data science, ML, and archival across mixed structured and unstructured data
Lakehouse Best Use Mixed workloads on a unified platform combining strengths of warehouse and lake
Common Pitfalls
Warehouse as Production Pitfall of pointing operational applications at the analytical store and degrading their performance
Lake Without Governance Pitfall of building a lake without zoning, cataloguing, and quality controls; becomes a swamp
Star Schema for Volatile Dimensions Pitfall of using Type 1 dimensions where Type 2 history is needed for trend analysis
One-Size-Fits-All Modelling Pitfall of forcing every analytical question into a single rigid star schema
Underestimating SCDs Pitfall of designing dimensions without considering how change will be handled
No Conformed Dimensions Pitfall of building data marts in isolation without shared customer, product, or date dimensions
Ignoring Lineage Pitfall of a warehouse where consumers cannot trace a number back to its source
Schema Sprawl in the Lake Pitfall of hundreds of poorly documented schemas with the same entity defined differently
Lakehouse Format Mix Pitfall of mixing Parquet, Delta, Iceberg, and Hudi without a stewardship strategy
Big-Bang Migration Pitfall of replacing the entire warehouse in a single big-bang cut-over
No Cost Controls Pitfall of pay-per-query platforms with no budget alerts; a single misconfigured pipeline can be costly
Lake or Warehouse as Project Pitfall of treating the analytical layer as a project that finishes rather than a continuous capability
Forgetting the Reverse Path Pitfall of investing in moving data into analytics and ignoring how results return to operations