flowchart LR
S["Sources"] --> E["Extract"]
E --> T1["Transform<br>(in staging)"]
T1 --> L1["Load<br>(into warehouse)"]
L1 --> A["Analytical<br>Store"]
S2["Sources"] --> E2["Extract"]
E2 --> L2["Load<br>(raw to warehouse)"]
L2 --> T2["Transform<br>(in warehouse)"]
T2 --> A2["Analytical<br>Store"]
style S fill:#fce4ec,stroke:#AD1457
style E fill:#fff3e0,stroke:#EF6C00
style T1 fill:#fff8e1,stroke:#F9A825
style L1 fill:#e3f2fd,stroke:#1976D2
style A fill:#e8f5e9,stroke:#388E3C
style S2 fill:#fce4ec,stroke:#AD1457
style E2 fill:#fff3e0,stroke:#EF6C00
style L2 fill:#e3f2fd,stroke:#1976D2
style T2 fill:#fff8e1,stroke:#F9A825
style A2 fill:#e8f5e9,stroke:#388E3C
44 ETL Processes and Data Pipelines
44.1 Why ETL and Data Pipelines Matter
Every dashboard rests on a pipeline; the dashboard is only as fresh, accurate, and reliable as the pipeline that feeds it.
A Power BI report or a Tableau dashboard is the visible tip of an iceberg. Underneath sits a data pipeline — the orchestrated sequence of steps that extracts data from sources, transforms it into analytical shape, and loads it to the analytical store the dashboard reads from. When the pipeline runs reliably, the dashboard stays trustworthy. When the pipeline fails — silently or loudly — the dashboard misleads.
The classical reference is The Data Warehouse ETL Toolkit by Ralph Kimball & Joe Caserta (2004), which codified the patterns most modern ETL tools still implement. The contemporary reference is Fundamentals of Data Engineering by Joe Reis & Matt Housley (2022), which sets ETL in the broader data-engineering lifecycle.
For a visualisation-focused book, this chapter is the bridge between the visualisation tools (Power BI, Tableau, Excel) and the source systems they consume. The pipeline is invisible to the audience; the dashboard’s quality is not.
44.2 ETL versus ELT versus ETLT
| Pattern | Idea | When to Use |
|---|---|---|
| ETL (Extract → Transform → Load) | Transform in a staging environment before loading the warehouse | Traditional warehouses with limited compute; on-prem SQL Server, Oracle |
| ELT (Extract → Load → Transform) | Load raw to the warehouse, transform inside | Modern cloud warehouses with elastic compute (Snowflake, BigQuery, Redshift, Databricks) |
| ETLT (Extract → Light Transform → Load → More Transform) | Lightweight cleaning before load, deeper modelling after | Hybrid pattern when source-side cleaning improves load speed |
The shift from ETL to ELT is the defining architectural trend of the past decade. Cheap cloud storage and elastic warehouse compute moved the transformation work from custom ETL servers into the warehouse itself, expressed in SQL — typically through tools like dbt.
44.3 Components of a Data Pipeline
A typical pipeline has six components:
- Sources — where the data starts: operational databases, SaaS APIs, file shares, event streams.
- Extraction Layer — connectors that pull data from sources, with credential management and rate-limit handling.
- Staging Area — a holding ground for raw extracted data, often a data lake zone (the “bronze” layer).
- Transformation Engine — the code or visual tool that cleans, joins, and shapes the data into analytical form.
- Target Store — the warehouse, lake, or lakehouse the dashboard reads from.
- Orchestrator — the scheduler and dependency manager that runs the pipeline reliably.
- Monitoring and Alerting — observability for failures, latency, data-quality regressions.
A complete pipeline carries all six. Skipping any one is the most common cause of pipeline incidents.
44.4 Batch versus Streaming
Pipelines run in one of two modes (or a hybrid):
Batch: Runs on a schedule (hourly, daily, weekly). Processes accumulated data since the last run. Simpler to build, debug, and retry. Suitable for most analytical workloads.
Streaming: Processes events as they arrive, often within seconds. Required for operational dashboards (logistics control towers, fraud detection), IoT telemetry, real-time personalisation. More complex infrastructure (Kafka, Flink, Spark Streaming, Kinesis).
Lambda Architecture: Parallel batch and streaming pipelines feeding the same analytical store; queries reconcile both layers.
Kappa Architecture: Streaming-only pipeline, with batch jobs treated as a special case of streams. Simpler than Lambda but demands a mature streaming platform.
For most BI dashboards (executive review, weekly performance, monthly board pack), batch is sufficient. Streaming is reserved for the small fraction of dashboards that genuinely need sub-minute freshness.
44.5 Common ETL and Pipeline Tools
| Category | Tools |
|---|---|
| Traditional ETL | Informatica PowerCenter, IBM DataStage, Microsoft SSIS, Pentaho, Talend Open Studio |
| Cloud-Native ELT | Fivetran, Airbyte, Stitch, Hevo, Matillion, Azure Data Factory, AWS Glue |
| Transformation in Warehouse | dbt, Dataform, SQLMesh, Coalesce |
| Streaming and CDC | Apache Kafka, Confluent Cloud, Debezium, AWS DMS, Apache Flink, Spark Structured Streaming |
| Workflow Orchestration | Apache Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions, Argo Workflows |
| Light ETL for BI Tools | Power BI Dataflows, Power Query, Tableau Prep |
| Data Catalogue and Lineage | Microsoft Purview, Collibra, Alation, Atlan, OpenMetadata, DataHub |
The right toolkit depends on scale, data residency, governance, and team skills. Most modern analytics programmes combine three or four tools — for example: Fivetran for SaaS-source extraction, dbt for transformation, Snowflake as the analytical store, Airflow for orchestration, and Atlan for lineage.
44.6 Data Pipeline Best Practices
A small set of disciplines separates fragile pipelines from durable ones:
- Idempotency: Running the pipeline twice produces the same result as running it once. Achieved by deterministic transformations and upsert-or-merge writes.
- Incremental loading: Where possible, process only new or changed rows since the last successful run. Reduces cost and runtime dramatically.
- Schema evolution: Source schemas change; the pipeline detects and accommodates new columns, dropped columns, and type changes without breaking.
- Testing: Run automated checks on every pipeline output — uniqueness of primary keys, expected row counts, value ranges, foreign-key integrity. Tools like dbt tests, Great Expectations, and Soda automate this.
- Lineage: Every analytical column traces back to its source columns through the pipeline. Modern data catalogues capture lineage automatically.
- Error handling and retries: Transient failures (network, rate limits) should retry with exponential back-off; permanent failures should alert the owner.
- Monitoring and SLAs: Define a service-level expectation for each pipeline (“daily refresh complete by 7:00 AM IST 99 % of the time”) and monitor against it.
- Version-controlled code: Pipeline definitions in Git; reviewed via pull requests; deployed through CI/CD. The era of clicking through GUI ETL tools without version control is ending.
- Documentation alongside code: Each pipeline carries an owner, a purpose, a schedule, and the dashboards or downstream tables it feeds.
44.7 Light ETL in Power BI and Excel
For smaller analytical contexts, BI tools include built-in light-ETL capability:
Power BI Dataflows: Cloud-native Power Query pipelines that run on a schedule and serve their results to multiple Power BI workbooks. Equivalent in spirit to a small dbt project, but with a no-code interface.
Power Query in Excel and Power BI Desktop: Workbook-level transformation pipelines (Chapter 35). Sufficient for individual analyst workflows but does not scale to enterprise-wide data engineering.
Tableau Prep: Tableau’s visual data-preparation tool. Builds reusable flows that can be scheduled on Tableau Server / Cloud and consumed by Tableau workbooks.
For analytical scope under a few hundred million rows and a moderate number of sources, these tools cover the data-engineering needs of most teams without a separate ETL platform. Beyond that scale, dedicated tools (Azure Data Factory, dbt + Snowflake, Airflow + Spark) become necessary.
44.8 Common Pitfalls
- Manual Pipelines: A pipeline that requires the analyst to click “Refresh” by hand each morning; eventually the analyst forgets and the dashboard goes stale.
- No Idempotency: Re-running fails or duplicates rows; recovery from partial failures is painful.
- Full Loads Only: A pipeline that fully reloads 200 million rows daily when an incremental load would process 200 thousand; cost and runtime balloon.
- No Tests: Pipeline succeeds technically but produces wrong data — duplicate keys, broken foreign-key relationships, suddenly missing values; the dashboard misleads silently.
- No Monitoring: A failure overnight is discovered when leadership opens the dashboard the next morning.
- Hard-Coded Credentials: Source passwords in scripts or workflow files; security incident waiting to happen. Use a secrets manager.
- No Schema Evolution Handling: Source adds a new column and the pipeline fails; or removes a column and the pipeline silently drops the analytical metric.
- Lineage Gap: Analytical columns whose origin no one can trace; debugging incidents takes hours.
- Pipeline Sprawl: Hundreds of small pipelines built by individuals over years, with no shared catalogue or ownership.
- No SLAs: Refresh time is “around morning” rather than a defined target; nobody is accountable for misses.
44.9 Illustrative Cases
A Modern dbt-on-Snowflake Pipeline
A retail firm migrates from SSIS to a modern stack: Fivetran extracts from twenty SaaS sources to Snowflake, dbt transforms in Snowflake, Airflow orchestrates the daily run, Power BI consumes the certified marts. Pipeline code lives in Git; tests run on every commit. The total runtime drops from six hours to forty minutes; the team’s productivity increases markedly.
A Failed Manual Pipeline
A marketing analyst’s weekly report depends on a CSV she manually exports from a SaaS tool every Monday morning. She is on holiday for two weeks; the report goes stale; leadership notices only when a competitor cites stale numbers in a meeting. The remediation: a Power BI Dataflow that pulls from the SaaS API on a schedule, eliminating the manual step entirely.
A Streaming Logistics Control Tower
A logistics firm’s control-tower dashboard requires sub-minute freshness on vehicle locations and route status. The team builds a streaming pipeline: vehicle telemetry into Kafka, Flink processes events and writes to a Snowflake real-time table, Power BI live-connects to the table. The streaming infrastructure costs more than a batch equivalent but is justified by the operational use case.
44.10 Hands-On Exercise: Building a Light ETL Pipeline with Power BI Dataflows
Aim: Build a Power BI Dataflow that ingests, cleans, and conforms data from two sources for Yuvijen Stores, then consume the dataflow from a Power BI dataset and a downstream report.
Deliverable: A Power BI Service workspace with a published Dataflow, a dataset that consumes it, and a report that reads from the dataset.
44.10.1 Step 1 — Create a Workspace and Open the Dataflow Editor
- In app.powerbi.com, navigate to a workspace (or create one for
Yuvijen Sales Analyticsif not already done in Chapter 38). - Click + New → Dataflow.
- Choose Define new tables to start from scratch.
The Dataflow editor opens — it is essentially Power Query in the cloud.
44.10.2 Step 2 — Connect to the First Source
- Get data → choose Text/CSV (or whichever source applies).
- Provide the connection details (URL, credentials).
- The data preview appears. Apply transformations:
- Use First Row as Headers.
- Change Type with Locale on each column.
- Trim and Clean text columns.
- Name the entity
customers_clean.
44.10.3 Step 3 — Connect to the Second Source
- Repeat for the orders source.
- Apply the same cleaning discipline.
- Add a Conditional Column for an order size band.
- Name the entity
orders_clean.
44.10.4 Step 4 — Conform Dimensions
- Add a third entity
dim_date_clean— either generate a date table from a query or import one. - Mark the date column with the appropriate type and locale.
The dataflow now has three conformed entities: customers, orders, date.
44.10.5 Step 5 — Configure Refresh Schedule
- Save and name the dataflow
Yuvijen Sales — Cleansed Mart. - Set a refresh schedule (e.g., Daily 5:00 AM IST).
- Configure failure notifications — the dataflow owner receives an email if any refresh fails.
44.10.6 Step 6 — Consume the Dataflow from a Power BI Desktop Workbook
- Open Power BI Desktop.
- Get Data → Power Platform → Dataflows.
- Sign in; navigate to the workspace and select the dataflow.
- Pick the three entities and load.
- Build the model — relationships, measures, visuals — exactly as in Chapter 36.
The Power BI workbook now consumes the cleaned, conformed dataflow rather than raw sources. Multiple workbooks across the firm can consume the same dataflow.
44.10.7 Step 7 — Add Quality Checks
In the dataflow editor, add a small monitoring entity:
- New entity called
dq_checks. - Use Power Query to count rows per source, count nulls per critical column, and detect duplicates on the customer ID.
- Each row of
dq_checksis one quality metric. - Build a small Power BI report on
dq_checksto display refresh-time quality scorecards.
The pipeline is now self-monitoring: a quality regression flags itself in a dedicated dashboard.
44.10.8 Step 8 — Document the Pipeline
In the workspace, add a description to the dataflow:
- Owner: name and contact.
- Source systems: documented.
- Refresh schedule: 5 AM IST daily.
- Downstream consumers: list of workbooks.
- SLA: refresh complete by 7 AM IST.
Documentation lives alongside the pipeline; the next analyst can find it without asking around.
44.10.9 Step 9 — Connect to the Visualisation Layer
The hands-on illustrates the pipeline-to-dashboard chain:
- Sources stream into the dataflow.
- The dataflow runs on a schedule, transforms, and stores cleaned tables in OneLake-backed storage.
- Multiple Power BI datasets consume the same dataflow.
- Multiple reports consume the datasets.
- The visualisation layer the audience sees is the final, refreshed, governed output.
A dashboard’s freshness, accuracy, and trustworthiness are entirely a function of the pipeline behind it. Without a reliable pipeline, even the best-designed dashboard misleads. The pipeline is invisible to the audience; that is exactly why it must be built with the most discipline.
Screenshots of the dataflow definition, refresh history, the consuming Power BI workbook (yuvijen-from-dataflow.pbix), and the quality scorecard report will be embedded here.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why ETL Matters | Every dashboard rests on a pipeline; the dashboard is only as fresh and reliable as the pipeline |
| ETL versus ELT versus ETLT | |
| ETL | Extract Transform Load; transform in staging before loading the warehouse |
| ELT | Extract Load Transform; load raw to the warehouse and transform inside |
| ETLT | Hybrid pattern with light cleaning before load and deeper modelling after |
| Modern Shift to ELT | Cheap cloud storage and elastic warehouse compute moved transformation into the warehouse via SQL |
| Pipeline Components | |
| Sources | Operational databases, SaaS APIs, file shares, event streams |
| Extraction Layer | Connectors that pull data from sources with credential management and rate-limit handling |
| Staging Area | Holding ground for raw extracted data, often a data lake bronze layer |
| Transformation Engine | Code or visual tool that cleans, joins, and shapes the data into analytical form |
| Target Store | Warehouse, lake, or lakehouse the dashboard reads from |
| Orchestrator | Scheduler and dependency manager that runs the pipeline reliably |
| Monitoring and Alerting | Observability for failures, latency, data-quality regressions |
| Batch versus Streaming | |
| Batch Pipeline | Runs on a schedule processing accumulated data since the last run |
| Streaming Pipeline | Processes events as they arrive within seconds; required for operational dashboards |
| Lambda Architecture | Parallel batch and streaming pipelines feeding the same analytical store |
| Kappa Architecture | Streaming-only pipeline with batch jobs treated as a special case |
| Tool Categories | |
| Traditional ETL Tools | Informatica, DataStage, SSIS, Pentaho, Talend |
| Cloud-Native ELT Tools | Fivetran, Airbyte, Stitch, Hevo, Matillion, Azure Data Factory, AWS Glue |
| Transformation in Warehouse | dbt, Dataform, SQLMesh, Coalesce |
| Streaming and CDC Tools | Kafka, Confluent Cloud, Debezium, AWS DMS, Flink, Spark Structured Streaming |
| Workflow Orchestration | Apache Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions |
| Light ETL for BI | Power BI Dataflows, Power Query, Tableau Prep for moderate-scale analytical work |
| Data Catalogue Tools | Microsoft Purview, Collibra, Alation, Atlan, OpenMetadata, DataHub |
| Best Practices | |
| Idempotency | Running the pipeline twice produces the same result as running once |
| Incremental Loading | Process only new or changed rows since the last successful run |
| Schema Evolution | Pipeline detects and accommodates schema changes without breaking |
| Pipeline Testing | Automated checks on every output: uniqueness, row counts, value ranges, foreign keys |
| Lineage | Every analytical column traces back to its source columns through the pipeline |
| Error Handling and Retries | Transient failures retry with back-off; permanent failures alert the owner |
| Monitoring and SLAs | Service-level expectation per pipeline measured and monitored |
| Version-Controlled Code | Pipeline definitions in Git; reviewed via pull requests; deployed through CI/CD |
| Documentation Alongside Code | Each pipeline carries owner, purpose, schedule, downstream consumers |
| Light ETL for BI Tools | |
| Power BI Dataflows | Cloud-native Power Query pipelines that run on a schedule and serve multiple workbooks |
| Power Query Light ETL | Workbook-level transformation pipelines sufficient for individual analyst workflows |
| Tableau Prep | Tableau's visual data-preparation tool with reusable scheduled flows |
| Common Pitfalls | |
| Manual Pipelines Pitfall | Pitfall of pipeline requiring manual click; eventually forgotten and dashboard goes stale |
| No Idempotency | Pitfall of re-runs failing or duplicating rows; painful recovery from partial failures |
| Full Loads Only | Pitfall of fully reloading hundreds of millions of rows when incremental would process thousands |
| No Tests | Pitfall of pipeline succeeding technically but producing wrong data silently |
| No Monitoring | Pitfall of overnight failure discovered only when leadership opens the dashboard |
| Hard-Coded Credentials | Pitfall of source passwords in scripts; security incident waiting to happen |
| No Schema Evolution Handling | Pitfall of source schema change breaking the pipeline or silently dropping a metric |
| Lineage Gap | Pitfall of analytical columns whose origin no one can trace; debugging takes hours |
| Pipeline Sprawl | Pitfall of hundreds of small pipelines built individually with no catalogue or ownership |
| No SLAs | Pitfall of refresh time being around morning rather than a defined target |