45  Data Warehousing Concepts for BI

45.1 Why Warehouse Design Matters for BI

The dashboards you can build, the speed they refresh at, and the truths they tell are all governed by the warehouse design under them.

Chapter 25 introduced the data warehouse and the data lake as analytical-layer concepts. This chapter narrows in on the warehouse decisions that specifically shape what BI tools — Power BI, Tableau, Excel — can do well.

A poorly designed warehouse forces every dashboard to do its own joins, write its own measures, and reconcile its own definitions. A well-designed warehouse hands the BI tool clean star schemas, conformed dimensions, materialised aggregates, and a stable semantic layer — and the dashboards built on top become fast, consistent, and easy to maintain.

The standard references for dimensional warehouse design are The Data Warehouse Toolkit by Ralph Kimball & Margy Ross (2013) (cited in Chapter 25) and Star Schema: The Complete Reference by Christopher Adamson (2010), which together codify the patterns BI tools were designed around.

For a visualisation-focused book, this chapter is where the analyst’s choices upstream (warehouse design, indexes, semantic layer) determine what the visualisation downstream is allowed to be.

45.2 The Star Schema, Revisited for BI

The single most consequential warehouse decision for BI is the star schema: one fact table at the centre, dimension tables radiating outward, related by single-key foreign-key relationships.

Power BI, Tableau, Looker, and Qlik are all engineered around this shape. Their query engines are fastest against a star, their relationships are cleanest against a star, and their measures are correctest against a star. Any deviation — snowflakes, flattened denormalised tables, many-to-many relationships — adds friction to every dashboard built on top.

TipStar versus Snowflake versus Flat for BI
Pattern What It Looks Like BI Performance Maintainability
Star Schema Fact + denormalised dimensions Fastest Best
Snowflake Schema Fact + dimensions normalised across multiple tables Slower joins Slightly more flexible
Flat / OBT (one big table) Fact and all dimensions denormalised into one wide table Variable Worst — definitions drift
Galaxy / Constellation Multiple fact tables sharing dimensions Same as star, scaled Best for multi-process

For most BI workloads, star is the answer. Snowflake makes sense only when a dimension is genuinely huge and rarely queried at full grain. Flat-table modelling is convenient for prototypes but accumulates definitional drift quickly.

45.3 Conformed Dimensions

A Conformed Dimension is a dimension table that has the same structure, the same keys, and the same meaning across every fact table that references it. The standard conformed dimensions in any analytics programme: Date, Customer, Product, Region, Employee.

Conformed dimensions are what allow the marketing dashboard, the finance dashboard, and the operations dashboard to agree on basic facts — same customer count, same regional definition, same date hierarchy. Without conformance, every dashboard reproduces its own slightly different version of the same dimension and the firm’s reports diverge.

Practical disciplines for conformed dimensions:

  • Single source of truth: One canonical Date table, one canonical Customer dimension, used everywhere.
  • Stable surrogate keys: Each row carries a never-reused integer surrogate key independent of the natural-key value (which can change).
  • Hierarchies modelled explicitly: Date carries Year, Quarter, Month, Week, Day; Geography carries Country, State, City, Branch; Product carries Category, Subcategory, Product.
  • Sort-by columns: For text values that need a non-alphabetical order (e.g., month names sorted by month number).
  • Hide technical columns: Foreign keys, surrogate keys, audit columns hidden from BI tools so users see only business-meaningful attributes.

45.4 Slowly Changing Dimensions for Historical Reporting

Dimensions change over time. The way the warehouse handles change determines whether historical dashboards report the current state of every entity or its state as it was at the historical date.

The classification (introduced in Chapter 25) applies directly to BI:

  • Type 1 (Overwrite): The current state replaces the historical. Old dashboards retroactively show today’s values. Suitable when historical accuracy of an attribute does not matter.
  • Type 2 (Add Row): Each change adds a new row with EffectiveFrom and EffectiveTo dates. The fact table foreign-key carries the historical surrogate, so dashboards faithfully reproduce historical state.
  • Type 3 (Add Column): Two columns — Current Value and Previous Value. Useful when the audience wants to compare both.

The single most consequential choice: whether to use Type 2 on the customer and product dimensions. Without it, a dashboard of Sales by Region last year may silently apply this year’s regional definitions, misleading the audience.

45.5 Aggregation Tables and Materialised Views

Most BI dashboards summarise transactions into months, quarters, or years. Forcing the BI tool to aggregate billions of detail rows on every dashboard refresh is wasteful. Aggregation tables and materialised views precompute these summaries.

Two patterns:

  • Aggregation Tables: Hand-built tables holding pre-aggregated data at the level the dashboard needs (e.g., monthly_sales_by_region). The BI tool either queries the aggregation directly or — better — uses aggregate awareness to route queries to the appropriate level automatically.

  • Materialised Views: Database-managed precomputed query results. Snowflake’s Materialized Views, BigQuery’s Materialized Views, Databricks SQL’s Live Tables all serve this pattern. The database refreshes the view on a schedule or incrementally.

Power BI supports Aggregations as a native feature: define an aggregated table, link it to the detail table, mark it as the aggregation source. Power BI’s query engine automatically uses the aggregation when the visual’s grain matches.

45.6 Indexing and Partitioning for BI Workloads

Database-level performance features that materially affect BI:

  • Clustered columnstore indexes (SQL Server, Synapse, Redshift, Snowflake’s micro-partitions): Compress and reorder data column by column. Optimal for BI’s typical query pattern (aggregating few columns over many rows).

  • Partitioning by date: Storing fact tables partitioned by year or month. Queries filtering by date scan only the relevant partitions. Reduces query time substantially for time-bounded BI views.

  • Clustering keys: Snowflake-style ordering of micro-partitions by frequently-filtered columns; reduces the data scanned.

  • Statistics: The database’s understanding of data distribution drives the optimiser; stale statistics produce bad query plans.

For BI workloads, the dominant pattern is clustered columnstore on the fact table, partitioned by date, with statistics refreshed regularly.

45.7 The Semantic Layer

A Semantic Layer is a curated metadata layer between the warehouse and the BI tool. It encodes business definitions — what Active Customer means, how Revenue is computed, what the Region hierarchy looks like — so that every dashboard built on top inherits the same definitions.

TipSemantic Layer Implementations
Tool Form
Power BI Published Datasets with measures, hierarchies, calculated columns; consumed by reports
Tableau Published Data Sources with calculated fields, hierarchies, sets, and parameters
Looker LookML — code-first semantic layer with version control and reusable views
dbt Semantic Layer dbt-defined metrics consumed by BI tools via a query API
Cube.dev / Cube.js Open-source headless semantic layer for any BI tool
AtScale, Kyligence Enterprise semantic-layer platforms

The semantic layer is the operational form of the certified data source idea from Chapter 30. Multiple dashboards consuming the same semantic layer agree by construction; dashboards built bypassing it diverge.

45.8 Data Marts versus the Enterprise Warehouse

A long-running tension in warehouse design:

  • Inmon’s Enterprise Data Warehouse: A normalised central warehouse holding all corporate data, with downstream data marts derived from it for specific subjects (Marketing, Finance, Operations).
  • Kimball’s Bus Architecture: Multiple subject-area data marts integrated through conformed dimensions; no central normalised warehouse required.

For BI work, Kimball’s bus architecture is usually the more practical pattern: build a Marketing Mart, Finance Mart, Operations Mart, each as a star schema, sharing conformed Date, Customer, and Product dimensions. The BI tool consumes the marts; the marts agree by virtue of their shared dimensions.

Modern cloud warehouses (Snowflake, BigQuery, Synapse) blur the distinction — the entire warehouse can be one logical container with marts realised as schemas or views — but the conformed-dimension discipline remains essential.

45.9 Cloud Warehouses for BI

TipCloud Warehouses Common in BI Work
Warehouse BI-Relevant Strengths
Snowflake Multi-cloud, separation of storage and compute, automatic micro-partitioning, native semi-structured support
Google BigQuery Serverless, massive scale, pay-per-query, BigQuery BI Engine for sub-second BI
Amazon Redshift Tight AWS integration, materialized views, RA3 instances with managed storage
Azure Synapse Analytics Tight Power BI integration, dedicated SQL pools, serverless SQL pools
Databricks SQL Lakehouse architecture, Delta Lake, Photon engine for BI workloads
Microsoft Fabric Warehouse Native Power BI integration via Direct Lake; OneLake-backed

The choice often follows the firm’s existing cloud commitments. For Microsoft-heavy enterprises, Synapse / Fabric is the natural fit; for AWS, Redshift; for multi-cloud or independent stacks, Snowflake; for ML-heavy environments, Databricks.

45.10 Performance Tuning for BI Workloads

A short list of optimisations that move BI dashboards from sluggish to instant:

  • Star schema, single-direction relationships: Already covered.
  • Aggregation tables / materialised views: For queries that summarise large fact tables.
  • Date-partitioned fact tables: Filter pushdown reduces scan size.
  • Cluster keys on commonly filtered columns: Snowflake-style clustering.
  • Statistics refresh: Schedule it; stale stats produce bad plans.
  • Avoid SELECT *: Both in BI tools and in warehouse views; always project explicitly.
  • Pre-compute hierarchy attributes: Year, Quarter, Month columns on the date table beat date functions in measures.
  • Cache layers: Power BI’s Aggregations + DirectQuery, Tableau’s extracts, Looker’s caching tier all reduce warehouse load.

For specific tools:

  • Power BI: Marked date table, Aggregations feature, single-direction relationships, explicit measures, Performance Analyzer.
  • Tableau: Extracts (Hyper engine), data source filters, context filters before LOD, custom SQL kept minimal.
  • Excel + Power Pivot: Star schema in the Data Model, marked date table, hidden technical columns.

45.11 Common Pitfalls

  • One Big Flat Table: Denormalised everything into a single wide table; loses governance and grows uncontrollably.
  • No Conformed Dimensions: Every BI tool builds its own definition of Customer; reports diverge silently.
  • Type 1 Where Type 2 Is Needed: Historical dashboards retroactively show today’s regional definitions.
  • No Aggregation Tables on Large Facts: Every dashboard scans 500 million rows to compute monthly totals; performance suffers.
  • Bidirectional Relationships in BI Models: Slow queries, ambiguity, hard to debug.
  • Stale Statistics: Database optimiser picks bad plans; queries that ran in seconds now take minutes.
  • No Semantic Layer: Every dashboard re-implements the same metric definitions; the firm’s numbers never quite agree.
  • No Date Table Discipline: Time-intelligence in Power BI / Tableau breaks because the date table is missing or unmarked.
  • Over-Snowflaking: Normalising dimensions into many small tables; BI joins multiply, performance degrades.
  • Pipeline Without Warehouse Discipline: Pipelines deliver clean data but into a flat-table model; the BI tool inherits the weakness.

45.12 Illustrative Cases

A Conformed-Dimension Save

A retail bank has separate customer dimensions in its marketing, risk, and finance marts. Customer counts disagree across dashboards by 3–5 %. The data team builds a conformed customer dimension, populates it with surrogate keys, and migrates each mart’s fact tables to reference it. After the migration, dashboards across the three functions agree on every basic fact for the first time in years.

A Power BI Aggregation That Made Reports Instant

A retailer’s Power BI dashboards on a 2-billion-row fact table take 8–15 seconds per visual. The team builds a monthly aggregation table at one-thousandth the size, marks it as a Power BI Aggregation, and links it to the fact. Subsequent dashboard interactions render in 200 ms. The user experience improves dramatically with no change to the visualisations.

A Type 2 SCD for Honest Historical Reporting

A telecom operator’s regional restructuring renames several regions. Type 1 SCDs would silently overwrite the old region names; historical dashboards would retroactively claim sales attributed to regions that did not exist at the time. The team migrates the customer and store dimensions to Type 2 with effective-dating; historical reports remain accurate and the new structure flows forward correctly.


45.13 Hands-On Exercise: Designing a BI-Ready Warehouse Schema

Aim: Design a star-schema warehouse for Yuvijen Stores in a cloud warehouse (Snowflake, BigQuery, Synapse, or even SQLite for offline practice), populate it from CSV sources, then connect Power BI to it and verify performance.

Deliverable: SQL DDL for the warehouse tables, populated tables, and a Power BI report consuming the warehouse with two DAX measures.

45.13.1 Step 1 — Design the Schema

Plan the star schema:

  • Fact table f_sales: order_id, customer_sk, product_sk, store_sk, date_sk, quantity, amount, profit.
  • Dimension dim_customer: customer_sk (surrogate), customer_natural_id, name, segment, region_sk, effective_from, effective_to, is_current.
  • Dimension dim_product: product_sk, product_id, name, subcategory, category.
  • Dimension dim_store: store_sk, store_id, name, region_sk.
  • Dimension dim_region: region_sk, region_code, region_name, country.
  • Dimension dim_date: date_sk, date, year, quarter, month, day, day_name, is_weekend, fiscal_year.

Note Type 2 SCD on dim_customer (effective_from, effective_to, is_current). The customer can change region over time; historical sales must match the historical region.

45.13.2 Step 2 — Write the DDL

CREATE TABLE dim_date (
    date_sk      INT PRIMARY KEY,
    date         DATE NOT NULL UNIQUE,
    year         INT,
    quarter      INT,
    month        INT,
    day          INT,
    day_name     VARCHAR(10),
    is_weekend   BOOLEAN,
    fiscal_year  INT
);

CREATE TABLE dim_region (
    region_sk    INT PRIMARY KEY,
    region_code  VARCHAR(10) UNIQUE,
    region_name  VARCHAR(50),
    country      VARCHAR(50)
);

CREATE TABLE dim_customer (
    customer_sk      INT PRIMARY KEY,
    customer_id      VARCHAR(20),
    name             VARCHAR(200),
    segment          VARCHAR(20),
    region_sk        INT REFERENCES dim_region(region_sk),
    effective_from   DATE,
    effective_to     DATE,
    is_current       BOOLEAN
);

CREATE TABLE dim_product (
    product_sk    INT PRIMARY KEY,
    product_id    VARCHAR(20),
    name          VARCHAR(200),
    subcategory   VARCHAR(50),
    category      VARCHAR(50)
);

CREATE TABLE dim_store (
    store_sk    INT PRIMARY KEY,
    store_id    VARCHAR(20),
    name        VARCHAR(200),
    region_sk   INT REFERENCES dim_region(region_sk)
);

CREATE TABLE f_sales (
    order_id      VARCHAR(20),
    customer_sk   INT REFERENCES dim_customer(customer_sk),
    product_sk    INT REFERENCES dim_product(product_sk),
    store_sk      INT REFERENCES dim_store(store_sk),
    date_sk       INT REFERENCES dim_date(date_sk),
    quantity      INT,
    amount        DECIMAL(18, 2),
    profit        DECIMAL(18, 2)
);

Save the script as yuvijen_warehouse_ddl.sql. Execute against your chosen warehouse.

45.13.3 Step 3 — Populate the Date Dimension

-- Sample for SQLite (Snowflake / BigQuery / Synapse have CALENDAR/SEQUENCE alternatives)
INSERT INTO dim_date (date_sk, date, year, quarter, month, day, day_name, is_weekend, fiscal_year)
SELECT
    CAST(strftime('%Y%m%d', date) AS INT) AS date_sk,
    date,
    CAST(strftime('%Y', date) AS INT) AS year,
    ((CAST(strftime('%m', date) AS INT) - 1) / 3) + 1 AS quarter,
    CAST(strftime('%m', date) AS INT) AS month,
    CAST(strftime('%d', date) AS INT) AS day,
    CASE strftime('%w', date)
        WHEN '0' THEN 'Sunday'
        WHEN '1' THEN 'Monday'
        ...
    END AS day_name,
    strftime('%w', date) IN ('0', '6') AS is_weekend,
    CASE WHEN CAST(strftime('%m', date) AS INT) >= 4
         THEN CAST(strftime('%Y', date) AS INT)
         ELSE CAST(strftime('%Y', date) AS INT) - 1
    END AS fiscal_year
FROM (
    -- generate dates from 2023-04-01 to 2026-12-31
    -- syntax depends on the warehouse
);

Most cloud warehouses have native date-table generators. The principle is the same: one row per date, with all the calendar attributes pre-computed.

45.13.4 Step 4 — Populate the Other Dimensions and the Fact

Use Power Query, dbt, or warehouse-native COPY to load the source CSVs into the dimensions and the fact table. The pattern from Chapter 35 (Power Query pipelines) and Chapter 44 (data pipelines) applies.

For Type 2 customer dimension, the ETL must:

  1. Detect changed rows (region update for an existing customer).
  2. Set effective_to = current_date and is_current = false on the existing row.
  3. Insert a new row with effective_from = current_date, effective_to = NULL, is_current = true.

dbt’s snapshot feature automates this pattern.

45.13.5 Step 5 — Connect Power BI

  1. In Power BI Desktop, Get Data → SQL Server / Snowflake / BigQuery / etc., point at the warehouse.
  2. Select the six tables.
  3. Choose Import mode (or DirectQuery for very large data with a fast warehouse).
  4. In the Model view, verify the relationships match the foreign keys; cardinality many-to-one, single-direction.
  5. Mark dim_date as a Date Table (Modeling → Mark as Date Table).
  6. Hide foreign-key and surrogate-key columns from the Fields pane.

45.13.6 Step 6 — Build Two DAX Measures

In a Measures table:

Total Sales = SUM(f_sales[amount])

Sales LY = CALCULATE([Total Sales], SAMEPERIODLASTYEAR('dim_date'[date]))

Build a small report exercising the model: a line chart of Total Sales vs Sales LY by month, a bar chart by dim_region[region_name], a slicer on dim_customer[segment].

45.13.7 Step 7 — Test SCD Type 2 Honesty

In the warehouse, manually update one customer’s region (a Type 2 transition):

  1. Set the existing customer row’s effective_to to today and is_current to false.
  2. Insert a new row for the same customer_id with the new region_sk, effective_from = today, is_current = true.
  3. Refresh the Power BI dataset.

Build a historical view by year. The customer’s sales last year should show under the old region; this year under the new region. If both years show under the same region, the SCD wiring is broken.

45.13.8 Step 8 — Performance Check

In Power BI Desktop, View → Performance Analyzer:

  1. Start Recording.
  2. Refresh the report. Click each visual.
  3. Stop Recording.
  4. Inspect the recorded DAX query times.

Visuals on a properly designed star schema should render in well under 500 ms even on million-row data. If any visual is slow, the most common causes are bidirectional relationships, missing date-table mark, or measure DAX that scans the fact table multiple times.

45.13.9 Step 9 — Connect to the Visualisation Layer

The hands-on demonstrates the foundational truth of BI architecture:

  • A well-designed warehouse hands the BI tool clean fuel: star schema, conformed dimensions, marked date table, materialised aggregates.
  • The dashboard inherits the warehouse’s strengths and weaknesses: a flat-table source produces drift; a star-schema source produces consistency.
  • Performance, correctness, and maintainability all rest on the warehouse layer the audience never sees.

The visualisation principles from Module 2, the BI-tool techniques from Modules 3–4, and the warehouse design from this chapter together form the chain that produces a trustworthy dashboard. Break any link and the whole chain weakens.

TipFiles and Screen Recordings

The DDL script (yuvijen_warehouse_ddl.sql), sample CSV data, the Power BI file (yuvijen-warehouse-report.pbix), and screen recordings of the schema setup, Type 2 SCD test, and Performance Analyzer review will be embedded here.


Summary

Concept Description
Foundations
Why Warehouse Design Matters for BI The dashboards possible, their refresh speed, and the truths they tell rest on warehouse design
Schema Patterns for BI
Star Schema for BI Fact at centre with denormalised dimensions; fastest for BI tools
Snowflake Schema for BI Dimensions normalised across multiple tables; slower joins, slightly more flexible
Flat OBT Fact and all dimensions denormalised into one wide table; convenient but drift-prone
Galaxy Schema Multiple fact tables sharing dimensions; the multi-process scaling of star
Conformed Dimensions
Conformed Dimension Same structure, keys, and meaning across every fact table that references it
Single Source of Truth Dimension One canonical Date, Customer, Product dimension used everywhere
Stable Surrogate Keys Never-reused integer surrogate independent of natural-key changes
Explicit Hierarchies Year, Quarter, Month explicit columns on date; Country, State, City on geography
Sort-By Columns Numeric column controlling sort order of a text column
Hide Technical Columns Foreign keys, surrogates, audit columns hidden from BI tools
Slowly Changing Dimensions
SCD Type 1 Overwrite; old dashboards retroactively show today's values
SCD Type 2 Add row with effective dates; faithful historical reporting
SCD Type 3 Add column for current and previous; useful for limited-history comparisons
Aggregation Tables and Materialised Views
Aggregation Tables Hand-built tables holding pre-aggregated data at the level dashboards need
Materialised Views Database-managed precomputed query results refreshed on schedule or incrementally
Power BI Aggregations Feature Native Power BI feature: define aggregation table, link to detail, mark as aggregation source
Indexing and Partitioning
Clustered Columnstore Indexes Compress and reorder data column by column; optimal for BI's typical query pattern
Date Partitioning Storing fact tables partitioned by year or month for filter pushdown
Clustering Keys Snowflake-style ordering of micro-partitions by frequently filtered columns
Statistics Refresh Stale stats produce bad query plans; schedule the refresh
The Semantic Layer
Semantic Layer Curated metadata layer between warehouse and BI tool encoding business definitions
Power BI Dataset Published dataset with measures, hierarchies, calculated columns; consumed by reports
Tableau Published Data Source Published Tableau Data Source with calculated fields, hierarchies, sets, parameters
Looker LookML Code-first semantic layer with version control and reusable views
dbt Semantic Layer dbt-defined metrics consumed by BI tools via a query API
Cube Headless Semantic Open-source headless semantic layer for any BI tool
Data Marts versus Enterprise Warehouse
Inmon Enterprise Warehouse Normalised central warehouse with subject-area data marts derived from it
Kimball Bus Architecture Multiple subject-area data marts integrated through conformed dimensions
Cloud Warehouses for BI
Snowflake Cloud Warehouse Multi-cloud, separation of storage and compute, automatic micro-partitioning
Google BigQuery Serverless, massive scale, pay-per-query, BI Engine for sub-second BI
Amazon Redshift Tight AWS integration, materialized views, RA3 instances
Azure Synapse Tight Power BI integration, dedicated and serverless SQL pools
Databricks SQL Lakehouse architecture, Delta Lake, Photon engine for BI workloads
Microsoft Fabric Warehouse Native Power BI integration via Direct Lake; OneLake-backed
Performance Tuning for BI
Star Schema Performance Rule Star schema with single-direction relationships outperforms alternatives
Aggregations and Materialised Views Pre-aggregate large facts; rely on aggregate-aware query routing
Date-Partitioned Facts Filter pushdown reduces scan size dramatically
Cluster Keys on Filters Cluster on commonly filtered columns to reduce scanned bytes
Avoid SELECT Star Always project columns explicitly in BI tools and warehouse views
Pre-Compute Hierarchy Attributes Pre-computed Year, Quarter, Month columns beat date functions in measures
Cache Layers Power BI Aggregations + DirectQuery, Tableau extracts, Looker caching tier
Power BI Performance Tools Marked date table, Aggregations, single-direction relationships, Performance Analyzer
Tableau Performance Tools Extracts (Hyper), data source filters, context filters before LOD, minimal Custom SQL
Excel Power Pivot Performance Star schema in Data Model, marked date table, hidden technical columns
Common Pitfalls
One Big Flat Table Pitfall of denormalising into a single wide table; loses governance and grows uncontrollably
No Conformed Dimensions Pitfall of every BI tool building its own dimension; reports diverge silently
Type 1 Where Type 2 Needed Pitfall of historical dashboards retroactively showing today's regional definitions
No Aggregations on Large Facts Pitfall of every dashboard scanning hundreds of millions of rows for monthly totals
Bidirectional Relationships Pitfall of bidirectional filters causing slow queries and ambiguity
Stale Statistics Pitfall of stale optimiser statistics producing bad query plans
No Semantic Layer Pitfall of every dashboard re-implementing metric definitions; numbers never agree
No Date Table Discipline Pitfall of missing or unmarked date table breaking time-intelligence DAX or table calcs
Over-Snowflaking Pitfall of normalising dimensions into many small tables; BI joins multiply, performance degrades
Pipeline Without Warehouse Discipline Pitfall of clean pipeline delivering data into a flat-table model; BI inherits the weakness