| Concept | Description |
|---|---|
| Foundations | |
| Why Warehouse Design Matters for BI | The dashboards possible, their refresh speed, and the truths they tell rest on warehouse design |
| Schema Patterns for BI | |
| Star Schema for BI | Fact at centre with denormalised dimensions; fastest for BI tools |
| Snowflake Schema for BI | Dimensions normalised across multiple tables; slower joins, slightly more flexible |
| Flat OBT | Fact and all dimensions denormalised into one wide table; convenient but drift-prone |
| Galaxy Schema | Multiple fact tables sharing dimensions; the multi-process scaling of star |
| Conformed Dimensions | |
| Conformed Dimension | Same structure, keys, and meaning across every fact table that references it |
| Single Source of Truth Dimension | One canonical Date, Customer, Product dimension used everywhere |
| Stable Surrogate Keys | Never-reused integer surrogate independent of natural-key changes |
| Explicit Hierarchies | Year, Quarter, Month explicit columns on date; Country, State, City on geography |
| Sort-By Columns | Numeric column controlling sort order of a text column |
| Hide Technical Columns | Foreign keys, surrogates, audit columns hidden from BI tools |
| Slowly Changing Dimensions | |
| SCD Type 1 | Overwrite; old dashboards retroactively show today's values |
| SCD Type 2 | Add row with effective dates; faithful historical reporting |
| SCD Type 3 | Add column for current and previous; useful for limited-history comparisons |
| Aggregation Tables and Materialised Views | |
| Aggregation Tables | Hand-built tables holding pre-aggregated data at the level dashboards need |
| Materialised Views | Database-managed precomputed query results refreshed on schedule or incrementally |
| Power BI Aggregations Feature | Native Power BI feature: define aggregation table, link to detail, mark as aggregation source |
| Indexing and Partitioning | |
| Clustered Columnstore Indexes | Compress and reorder data column by column; optimal for BI's typical query pattern |
| Date Partitioning | Storing fact tables partitioned by year or month for filter pushdown |
| Clustering Keys | Snowflake-style ordering of micro-partitions by frequently filtered columns |
| Statistics Refresh | Stale stats produce bad query plans; schedule the refresh |
| The Semantic Layer | |
| Semantic Layer | Curated metadata layer between warehouse and BI tool encoding business definitions |
| Power BI Dataset | Published dataset with measures, hierarchies, calculated columns; consumed by reports |
| Tableau Published Data Source | Published Tableau Data Source with calculated fields, hierarchies, sets, parameters |
| Looker LookML | Code-first semantic layer with version control and reusable views |
| dbt Semantic Layer | dbt-defined metrics consumed by BI tools via a query API |
| Cube Headless Semantic | Open-source headless semantic layer for any BI tool |
| Data Marts versus Enterprise Warehouse | |
| Inmon Enterprise Warehouse | Normalised central warehouse with subject-area data marts derived from it |
| Kimball Bus Architecture | Multiple subject-area data marts integrated through conformed dimensions |
| Cloud Warehouses for BI | |
| Snowflake Cloud Warehouse | Multi-cloud, separation of storage and compute, automatic micro-partitioning |
| Google BigQuery | Serverless, massive scale, pay-per-query, BI Engine for sub-second BI |
| Amazon Redshift | Tight AWS integration, materialized views, RA3 instances |
| Azure Synapse | Tight Power BI integration, dedicated and serverless SQL pools |
| Databricks SQL | Lakehouse architecture, Delta Lake, Photon engine for BI workloads |
| Microsoft Fabric Warehouse | Native Power BI integration via Direct Lake; OneLake-backed |
| Performance Tuning for BI | |
| Star Schema Performance Rule | Star schema with single-direction relationships outperforms alternatives |
| Aggregations and Materialised Views | Pre-aggregate large facts; rely on aggregate-aware query routing |
| Date-Partitioned Facts | Filter pushdown reduces scan size dramatically |
| Cluster Keys on Filters | Cluster on commonly filtered columns to reduce scanned bytes |
| Avoid SELECT Star | Always project columns explicitly in BI tools and warehouse views |
| Pre-Compute Hierarchy Attributes | Pre-computed Year, Quarter, Month columns beat date functions in measures |
| Cache Layers | Power BI Aggregations + DirectQuery, Tableau extracts, Looker caching tier |
| Power BI Performance Tools | Marked date table, Aggregations, single-direction relationships, Performance Analyzer |
| Tableau Performance Tools | Extracts (Hyper), data source filters, context filters before LOD, minimal Custom SQL |
| Excel Power Pivot Performance | Star schema in Data Model, marked date table, hidden technical columns |
| Common Pitfalls | |
| One Big Flat Table | Pitfall of denormalising into a single wide table; loses governance and grows uncontrollably |
| No Conformed Dimensions | Pitfall of every BI tool building its own dimension; reports diverge silently |
| Type 1 Where Type 2 Needed | Pitfall of historical dashboards retroactively showing today's regional definitions |
| No Aggregations on Large Facts | Pitfall of every dashboard scanning hundreds of millions of rows for monthly totals |
| Bidirectional Relationships | Pitfall of bidirectional filters causing slow queries and ambiguity |
| Stale Statistics | Pitfall of stale optimiser statistics producing bad query plans |
| No Semantic Layer | Pitfall of every dashboard re-implementing metric definitions; numbers never agree |
| No Date Table Discipline | Pitfall of missing or unmarked date table breaking time-intelligence DAX or table calcs |
| Over-Snowflaking | Pitfall of normalising dimensions into many small tables; BI joins multiply, performance degrades |
| Pipeline Without Warehouse Discipline | Pitfall of clean pipeline delivering data into a flat-table model; BI inherits the weakness |
45 Data Warehousing Concepts for BI
45.1 Why Warehouse Design Matters for BI
The dashboards you can build, the speed they refresh at, and the truths they tell are all governed by the warehouse design under them.
Chapter 25 introduced the data warehouse and the data lake as analytical-layer concepts. This chapter narrows in on the warehouse decisions that specifically shape what BI tools — Power BI, Tableau, Excel — can do well.
A poorly designed warehouse forces every dashboard to do its own joins, write its own measures, and reconcile its own definitions. A well-designed warehouse hands the BI tool clean star schemas, conformed dimensions, materialised aggregates, and a stable semantic layer — and the dashboards built on top become fast, consistent, and easy to maintain.
The standard references for dimensional warehouse design are The Data Warehouse Toolkit by Ralph Kimball & Margy Ross (2013) (cited in Chapter 25) and Star Schema: The Complete Reference by Christopher Adamson (2010), which together codify the patterns BI tools were designed around.
For a visualisation-focused book, this chapter is where the analyst’s choices upstream (warehouse design, indexes, semantic layer) determine what the visualisation downstream is allowed to be.
45.2 The Star Schema, Revisited for BI
The single most consequential warehouse decision for BI is the star schema: one fact table at the centre, dimension tables radiating outward, related by single-key foreign-key relationships.
Power BI, Tableau, Looker, and Qlik are all engineered around this shape. Their query engines are fastest against a star, their relationships are cleanest against a star, and their measures are correctest against a star. Any deviation — snowflakes, flattened denormalised tables, many-to-many relationships — adds friction to every dashboard built on top.
| Pattern | What It Looks Like | BI Performance | Maintainability |
|---|---|---|---|
| Star Schema | Fact + denormalised dimensions | Fastest | Best |
| Snowflake Schema | Fact + dimensions normalised across multiple tables | Slower joins | Slightly more flexible |
| Flat / OBT (one big table) | Fact and all dimensions denormalised into one wide table | Variable | Worst — definitions drift |
| Galaxy / Constellation | Multiple fact tables sharing dimensions | Same as star, scaled | Best for multi-process |
For most BI workloads, star is the answer. Snowflake makes sense only when a dimension is genuinely huge and rarely queried at full grain. Flat-table modelling is convenient for prototypes but accumulates definitional drift quickly.
45.3 Conformed Dimensions
A Conformed Dimension is a dimension table that has the same structure, the same keys, and the same meaning across every fact table that references it. The standard conformed dimensions in any analytics programme: Date, Customer, Product, Region, Employee.
Conformed dimensions are what allow the marketing dashboard, the finance dashboard, and the operations dashboard to agree on basic facts — same customer count, same regional definition, same date hierarchy. Without conformance, every dashboard reproduces its own slightly different version of the same dimension and the firm’s reports diverge.
Practical disciplines for conformed dimensions:
- Single source of truth: One canonical Date table, one canonical Customer dimension, used everywhere.
- Stable surrogate keys: Each row carries a never-reused integer surrogate key independent of the natural-key value (which can change).
- Hierarchies modelled explicitly: Date carries Year, Quarter, Month, Week, Day; Geography carries Country, State, City, Branch; Product carries Category, Subcategory, Product.
- Sort-by columns: For text values that need a non-alphabetical order (e.g., month names sorted by month number).
- Hide technical columns: Foreign keys, surrogate keys, audit columns hidden from BI tools so users see only business-meaningful attributes.
45.4 Slowly Changing Dimensions for Historical Reporting
Dimensions change over time. The way the warehouse handles change determines whether historical dashboards report the current state of every entity or its state as it was at the historical date.
The classification (introduced in Chapter 25) applies directly to BI:
- Type 1 (Overwrite): The current state replaces the historical. Old dashboards retroactively show today’s values. Suitable when historical accuracy of an attribute does not matter.
-
Type 2 (Add Row): Each change adds a new row with
EffectiveFromandEffectiveTodates. The fact table foreign-key carries the historical surrogate, so dashboards faithfully reproduce historical state. -
Type 3 (Add Column): Two columns —
Current ValueandPrevious Value. Useful when the audience wants to compare both.
The single most consequential choice: whether to use Type 2 on the customer and product dimensions. Without it, a dashboard of Sales by Region last year may silently apply this year’s regional definitions, misleading the audience.
45.5 Aggregation Tables and Materialised Views
Most BI dashboards summarise transactions into months, quarters, or years. Forcing the BI tool to aggregate billions of detail rows on every dashboard refresh is wasteful. Aggregation tables and materialised views precompute these summaries.
Two patterns:
Aggregation Tables: Hand-built tables holding pre-aggregated data at the level the dashboard needs (e.g.,
monthly_sales_by_region). The BI tool either queries the aggregation directly or — better — uses aggregate awareness to route queries to the appropriate level automatically.Materialised Views: Database-managed precomputed query results. Snowflake’s Materialized Views, BigQuery’s Materialized Views, Databricks SQL’s Live Tables all serve this pattern. The database refreshes the view on a schedule or incrementally.
Power BI supports Aggregations as a native feature: define an aggregated table, link it to the detail table, mark it as the aggregation source. Power BI’s query engine automatically uses the aggregation when the visual’s grain matches.
45.6 Indexing and Partitioning for BI Workloads
Database-level performance features that materially affect BI:
Clustered columnstore indexes (SQL Server, Synapse, Redshift, Snowflake’s micro-partitions): Compress and reorder data column by column. Optimal for BI’s typical query pattern (aggregating few columns over many rows).
Partitioning by date: Storing fact tables partitioned by year or month. Queries filtering by date scan only the relevant partitions. Reduces query time substantially for time-bounded BI views.
Clustering keys: Snowflake-style ordering of micro-partitions by frequently-filtered columns; reduces the data scanned.
Statistics: The database’s understanding of data distribution drives the optimiser; stale statistics produce bad query plans.
For BI workloads, the dominant pattern is clustered columnstore on the fact table, partitioned by date, with statistics refreshed regularly.
45.7 The Semantic Layer
A Semantic Layer is a curated metadata layer between the warehouse and the BI tool. It encodes business definitions — what Active Customer means, how Revenue is computed, what the Region hierarchy looks like — so that every dashboard built on top inherits the same definitions.
| Tool | Form |
|---|---|
| Power BI | Published Datasets with measures, hierarchies, calculated columns; consumed by reports |
| Tableau | Published Data Sources with calculated fields, hierarchies, sets, and parameters |
| Looker | LookML — code-first semantic layer with version control and reusable views |
| dbt Semantic Layer | dbt-defined metrics consumed by BI tools via a query API |
| Cube.dev / Cube.js | Open-source headless semantic layer for any BI tool |
| AtScale, Kyligence | Enterprise semantic-layer platforms |
The semantic layer is the operational form of the certified data source idea from Chapter 30. Multiple dashboards consuming the same semantic layer agree by construction; dashboards built bypassing it diverge.
45.8 Data Marts versus the Enterprise Warehouse
A long-running tension in warehouse design:
- Inmon’s Enterprise Data Warehouse: A normalised central warehouse holding all corporate data, with downstream data marts derived from it for specific subjects (Marketing, Finance, Operations).
- Kimball’s Bus Architecture: Multiple subject-area data marts integrated through conformed dimensions; no central normalised warehouse required.
For BI work, Kimball’s bus architecture is usually the more practical pattern: build a Marketing Mart, Finance Mart, Operations Mart, each as a star schema, sharing conformed Date, Customer, and Product dimensions. The BI tool consumes the marts; the marts agree by virtue of their shared dimensions.
Modern cloud warehouses (Snowflake, BigQuery, Synapse) blur the distinction — the entire warehouse can be one logical container with marts realised as schemas or views — but the conformed-dimension discipline remains essential.
45.9 Cloud Warehouses for BI
| Warehouse | BI-Relevant Strengths |
|---|---|
| Snowflake | Multi-cloud, separation of storage and compute, automatic micro-partitioning, native semi-structured support |
| Google BigQuery | Serverless, massive scale, pay-per-query, BigQuery BI Engine for sub-second BI |
| Amazon Redshift | Tight AWS integration, materialized views, RA3 instances with managed storage |
| Azure Synapse Analytics | Tight Power BI integration, dedicated SQL pools, serverless SQL pools |
| Databricks SQL | Lakehouse architecture, Delta Lake, Photon engine for BI workloads |
| Microsoft Fabric Warehouse | Native Power BI integration via Direct Lake; OneLake-backed |
The choice often follows the firm’s existing cloud commitments. For Microsoft-heavy enterprises, Synapse / Fabric is the natural fit; for AWS, Redshift; for multi-cloud or independent stacks, Snowflake; for ML-heavy environments, Databricks.
45.10 Performance Tuning for BI Workloads
A short list of optimisations that move BI dashboards from sluggish to instant:
- Star schema, single-direction relationships: Already covered.
- Aggregation tables / materialised views: For queries that summarise large fact tables.
- Date-partitioned fact tables: Filter pushdown reduces scan size.
- Cluster keys on commonly filtered columns: Snowflake-style clustering.
- Statistics refresh: Schedule it; stale stats produce bad plans.
- Avoid SELECT *: Both in BI tools and in warehouse views; always project explicitly.
- Pre-compute hierarchy attributes: Year, Quarter, Month columns on the date table beat date functions in measures.
- Cache layers: Power BI’s Aggregations + DirectQuery, Tableau’s extracts, Looker’s caching tier all reduce warehouse load.
For specific tools:
- Power BI: Marked date table, Aggregations feature, single-direction relationships, explicit measures, Performance Analyzer.
- Tableau: Extracts (Hyper engine), data source filters, context filters before LOD, custom SQL kept minimal.
- Excel + Power Pivot: Star schema in the Data Model, marked date table, hidden technical columns.
45.11 Common Pitfalls
- One Big Flat Table: Denormalised everything into a single wide table; loses governance and grows uncontrollably.
- No Conformed Dimensions: Every BI tool builds its own definition of Customer; reports diverge silently.
- Type 1 Where Type 2 Is Needed: Historical dashboards retroactively show today’s regional definitions.
- No Aggregation Tables on Large Facts: Every dashboard scans 500 million rows to compute monthly totals; performance suffers.
- Bidirectional Relationships in BI Models: Slow queries, ambiguity, hard to debug.
- Stale Statistics: Database optimiser picks bad plans; queries that ran in seconds now take minutes.
- No Semantic Layer: Every dashboard re-implements the same metric definitions; the firm’s numbers never quite agree.
- No Date Table Discipline: Time-intelligence in Power BI / Tableau breaks because the date table is missing or unmarked.
- Over-Snowflaking: Normalising dimensions into many small tables; BI joins multiply, performance degrades.
- Pipeline Without Warehouse Discipline: Pipelines deliver clean data but into a flat-table model; the BI tool inherits the weakness.
45.12 Illustrative Cases
A Conformed-Dimension Save
A retail bank has separate customer dimensions in its marketing, risk, and finance marts. Customer counts disagree across dashboards by 3–5 %. The data team builds a conformed customer dimension, populates it with surrogate keys, and migrates each mart’s fact tables to reference it. After the migration, dashboards across the three functions agree on every basic fact for the first time in years.
A Power BI Aggregation That Made Reports Instant
A retailer’s Power BI dashboards on a 2-billion-row fact table take 8–15 seconds per visual. The team builds a monthly aggregation table at one-thousandth the size, marks it as a Power BI Aggregation, and links it to the fact. Subsequent dashboard interactions render in 200 ms. The user experience improves dramatically with no change to the visualisations.
A Type 2 SCD for Honest Historical Reporting
A telecom operator’s regional restructuring renames several regions. Type 1 SCDs would silently overwrite the old region names; historical dashboards would retroactively claim sales attributed to regions that did not exist at the time. The team migrates the customer and store dimensions to Type 2 with effective-dating; historical reports remain accurate and the new structure flows forward correctly.
45.13 Hands-On Exercise: Designing a BI-Ready Warehouse Schema
Aim: Design a star-schema warehouse for Yuvijen Stores in a cloud warehouse (Snowflake, BigQuery, Synapse, or even SQLite for offline practice), populate it from CSV sources, then connect Power BI to it and verify performance.
Deliverable: SQL DDL for the warehouse tables, populated tables, and a Power BI report consuming the warehouse with two DAX measures.
45.13.1 Step 1 — Design the Schema
Plan the star schema:
-
Fact table
f_sales: order_id, customer_sk, product_sk, store_sk, date_sk, quantity, amount, profit. -
Dimension
dim_customer: customer_sk (surrogate), customer_natural_id, name, segment, region_sk, effective_from, effective_to, is_current. -
Dimension
dim_product: product_sk, product_id, name, subcategory, category. -
Dimension
dim_store: store_sk, store_id, name, region_sk. -
Dimension
dim_region: region_sk, region_code, region_name, country. -
Dimension
dim_date: date_sk, date, year, quarter, month, day, day_name, is_weekend, fiscal_year.
Note Type 2 SCD on dim_customer (effective_from, effective_to, is_current). The customer can change region over time; historical sales must match the historical region.
45.13.2 Step 2 — Write the DDL
CREATE TABLE dim_date (
date_sk INT PRIMARY KEY,
date DATE NOT NULL UNIQUE,
year INT,
quarter INT,
month INT,
day INT,
day_name VARCHAR(10),
is_weekend BOOLEAN,
fiscal_year INT
);
CREATE TABLE dim_region (
region_sk INT PRIMARY KEY,
region_code VARCHAR(10) UNIQUE,
region_name VARCHAR(50),
country VARCHAR(50)
);
CREATE TABLE dim_customer (
customer_sk INT PRIMARY KEY,
customer_id VARCHAR(20),
name VARCHAR(200),
segment VARCHAR(20),
region_sk INT REFERENCES dim_region(region_sk),
effective_from DATE,
effective_to DATE,
is_current BOOLEAN
);
CREATE TABLE dim_product (
product_sk INT PRIMARY KEY,
product_id VARCHAR(20),
name VARCHAR(200),
subcategory VARCHAR(50),
category VARCHAR(50)
);
CREATE TABLE dim_store (
store_sk INT PRIMARY KEY,
store_id VARCHAR(20),
name VARCHAR(200),
region_sk INT REFERENCES dim_region(region_sk)
);
CREATE TABLE f_sales (
order_id VARCHAR(20),
customer_sk INT REFERENCES dim_customer(customer_sk),
product_sk INT REFERENCES dim_product(product_sk),
store_sk INT REFERENCES dim_store(store_sk),
date_sk INT REFERENCES dim_date(date_sk),
quantity INT,
amount DECIMAL(18, 2),
profit DECIMAL(18, 2)
);Save the script as yuvijen_warehouse_ddl.sql. Execute against your chosen warehouse.
45.13.3 Step 3 — Populate the Date Dimension
-- Sample for SQLite (Snowflake / BigQuery / Synapse have CALENDAR/SEQUENCE alternatives)
INSERT INTO dim_date (date_sk, date, year, quarter, month, day, day_name, is_weekend, fiscal_year)
SELECT
CAST(strftime('%Y%m%d', date) AS INT) AS date_sk,
date,
CAST(strftime('%Y', date) AS INT) AS year,
((CAST(strftime('%m', date) AS INT) - 1) / 3) + 1 AS quarter,
CAST(strftime('%m', date) AS INT) AS month,
CAST(strftime('%d', date) AS INT) AS day,
CASE strftime('%w', date)
WHEN '0' THEN 'Sunday'
WHEN '1' THEN 'Monday'
...
END AS day_name,
strftime('%w', date) IN ('0', '6') AS is_weekend,
CASE WHEN CAST(strftime('%m', date) AS INT) >= 4
THEN CAST(strftime('%Y', date) AS INT)
ELSE CAST(strftime('%Y', date) AS INT) - 1
END AS fiscal_year
FROM (
-- generate dates from 2023-04-01 to 2026-12-31
-- syntax depends on the warehouse
);Most cloud warehouses have native date-table generators. The principle is the same: one row per date, with all the calendar attributes pre-computed.
45.13.4 Step 4 — Populate the Other Dimensions and the Fact
Use Power Query, dbt, or warehouse-native COPY to load the source CSVs into the dimensions and the fact table. The pattern from Chapter 35 (Power Query pipelines) and Chapter 44 (data pipelines) applies.
For Type 2 customer dimension, the ETL must:
- Detect changed rows (region update for an existing customer).
- Set
effective_to = current_dateandis_current = falseon the existing row. - Insert a new row with
effective_from = current_date,effective_to = NULL,is_current = true.
dbt’s snapshot feature automates this pattern.
45.13.5 Step 5 — Connect Power BI
- In Power BI Desktop, Get Data → SQL Server / Snowflake / BigQuery / etc., point at the warehouse.
- Select the six tables.
- Choose Import mode (or DirectQuery for very large data with a fast warehouse).
- In the Model view, verify the relationships match the foreign keys; cardinality many-to-one, single-direction.
- Mark
dim_dateas a Date Table (Modeling → Mark as Date Table). - Hide foreign-key and surrogate-key columns from the Fields pane.
45.13.6 Step 6 — Build Two DAX Measures
In a Measures table:
Total Sales = SUM(f_sales[amount])
Sales LY = CALCULATE([Total Sales], SAMEPERIODLASTYEAR('dim_date'[date]))
Build a small report exercising the model: a line chart of Total Sales vs Sales LY by month, a bar chart by dim_region[region_name], a slicer on dim_customer[segment].
45.13.7 Step 7 — Test SCD Type 2 Honesty
In the warehouse, manually update one customer’s region (a Type 2 transition):
- Set the existing customer row’s
effective_toto today andis_currentto false. - Insert a new row for the same
customer_idwith the newregion_sk,effective_from = today,is_current = true. - Refresh the Power BI dataset.
Build a historical view by year. The customer’s sales last year should show under the old region; this year under the new region. If both years show under the same region, the SCD wiring is broken.
45.13.8 Step 8 — Performance Check
In Power BI Desktop, View → Performance Analyzer:
- Start Recording.
- Refresh the report. Click each visual.
- Stop Recording.
- Inspect the recorded DAX query times.
Visuals on a properly designed star schema should render in well under 500 ms even on million-row data. If any visual is slow, the most common causes are bidirectional relationships, missing date-table mark, or measure DAX that scans the fact table multiple times.
45.13.9 Step 9 — Connect to the Visualisation Layer
The hands-on demonstrates the foundational truth of BI architecture:
- A well-designed warehouse hands the BI tool clean fuel: star schema, conformed dimensions, marked date table, materialised aggregates.
- The dashboard inherits the warehouse’s strengths and weaknesses: a flat-table source produces drift; a star-schema source produces consistency.
- Performance, correctness, and maintainability all rest on the warehouse layer the audience never sees.
The visualisation principles from Module 2, the BI-tool techniques from Modules 3–4, and the warehouse design from this chapter together form the chain that produces a trustworthy dashboard. Break any link and the whole chain weakens.
The DDL script (yuvijen_warehouse_ddl.sql), sample CSV data, the Power BI file (yuvijen-warehouse-report.pbix), and screen recordings of the schema setup, Type 2 SCD test, and Performance Analyzer review will be embedded here.