44  ETL Processes and Data Pipelines

44.1 Why ETL and Data Pipelines Matter

Every dashboard rests on a pipeline; the dashboard is only as fresh, accurate, and reliable as the pipeline that feeds it.

A Power BI report or a Tableau dashboard is the visible tip of an iceberg. Underneath sits a data pipeline — the orchestrated sequence of steps that extracts data from sources, transforms it into analytical shape, and loads it to the analytical store the dashboard reads from. When the pipeline runs reliably, the dashboard stays trustworthy. When the pipeline fails — silently or loudly — the dashboard misleads.

The classical reference is The Data Warehouse ETL Toolkit by Ralph Kimball & Joe Caserta (2004), which codified the patterns most modern ETL tools still implement. The contemporary reference is Fundamentals of Data Engineering by Joe Reis & Matt Housley (2022), which sets ETL in the broader data-engineering lifecycle.

For a visualisation-focused book, this chapter is the bridge between the visualisation tools (Power BI, Tableau, Excel) and the source systems they consume. The pipeline is invisible to the audience; the dashboard’s quality is not.

44.2 ETL versus ELT versus ETLT

flowchart LR
    S["Sources"] --> E["Extract"]
    E --> T1["Transform<br>(in staging)"]
    T1 --> L1["Load<br>(into warehouse)"]
    L1 --> A["Analytical<br>Store"]
    S2["Sources"] --> E2["Extract"]
    E2 --> L2["Load<br>(raw to warehouse)"]
    L2 --> T2["Transform<br>(in warehouse)"]
    T2 --> A2["Analytical<br>Store"]
    style S fill:#fce4ec,stroke:#AD1457
    style E fill:#fff3e0,stroke:#EF6C00
    style T1 fill:#fff8e1,stroke:#F9A825
    style L1 fill:#e3f2fd,stroke:#1976D2
    style A fill:#e8f5e9,stroke:#388E3C
    style S2 fill:#fce4ec,stroke:#AD1457
    style E2 fill:#fff3e0,stroke:#EF6C00
    style L2 fill:#e3f2fd,stroke:#1976D2
    style T2 fill:#fff8e1,stroke:#F9A825
    style A2 fill:#e8f5e9,stroke:#388E3C

TipThe Three Patterns
Pattern Idea When to Use
ETL (Extract → Transform → Load) Transform in a staging environment before loading the warehouse Traditional warehouses with limited compute; on-prem SQL Server, Oracle
ELT (Extract → Load → Transform) Load raw to the warehouse, transform inside Modern cloud warehouses with elastic compute (Snowflake, BigQuery, Redshift, Databricks)
ETLT (Extract → Light Transform → Load → More Transform) Lightweight cleaning before load, deeper modelling after Hybrid pattern when source-side cleaning improves load speed

The shift from ETL to ELT is the defining architectural trend of the past decade. Cheap cloud storage and elastic warehouse compute moved the transformation work from custom ETL servers into the warehouse itself, expressed in SQL — typically through tools like dbt.

44.3 Components of a Data Pipeline

A typical pipeline has six components:

  • Sources — where the data starts: operational databases, SaaS APIs, file shares, event streams.
  • Extraction Layer — connectors that pull data from sources, with credential management and rate-limit handling.
  • Staging Area — a holding ground for raw extracted data, often a data lake zone (the “bronze” layer).
  • Transformation Engine — the code or visual tool that cleans, joins, and shapes the data into analytical form.
  • Target Store — the warehouse, lake, or lakehouse the dashboard reads from.
  • Orchestrator — the scheduler and dependency manager that runs the pipeline reliably.
  • Monitoring and Alerting — observability for failures, latency, data-quality regressions.

A complete pipeline carries all six. Skipping any one is the most common cause of pipeline incidents.

44.4 Batch versus Streaming

Pipelines run in one of two modes (or a hybrid):

  • Batch: Runs on a schedule (hourly, daily, weekly). Processes accumulated data since the last run. Simpler to build, debug, and retry. Suitable for most analytical workloads.

  • Streaming: Processes events as they arrive, often within seconds. Required for operational dashboards (logistics control towers, fraud detection), IoT telemetry, real-time personalisation. More complex infrastructure (Kafka, Flink, Spark Streaming, Kinesis).

  • Lambda Architecture: Parallel batch and streaming pipelines feeding the same analytical store; queries reconcile both layers.

  • Kappa Architecture: Streaming-only pipeline, with batch jobs treated as a special case of streams. Simpler than Lambda but demands a mature streaming platform.

For most BI dashboards (executive review, weekly performance, monthly board pack), batch is sufficient. Streaming is reserved for the small fraction of dashboards that genuinely need sub-minute freshness.

44.5 Common ETL and Pipeline Tools

TipETL and Pipeline Tools by Category
Category Tools
Traditional ETL Informatica PowerCenter, IBM DataStage, Microsoft SSIS, Pentaho, Talend Open Studio
Cloud-Native ELT Fivetran, Airbyte, Stitch, Hevo, Matillion, Azure Data Factory, AWS Glue
Transformation in Warehouse dbt, Dataform, SQLMesh, Coalesce
Streaming and CDC Apache Kafka, Confluent Cloud, Debezium, AWS DMS, Apache Flink, Spark Structured Streaming
Workflow Orchestration Apache Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions, Argo Workflows
Light ETL for BI Tools Power BI Dataflows, Power Query, Tableau Prep
Data Catalogue and Lineage Microsoft Purview, Collibra, Alation, Atlan, OpenMetadata, DataHub

The right toolkit depends on scale, data residency, governance, and team skills. Most modern analytics programmes combine three or four tools — for example: Fivetran for SaaS-source extraction, dbt for transformation, Snowflake as the analytical store, Airflow for orchestration, and Atlan for lineage.

44.6 Data Pipeline Best Practices

A small set of disciplines separates fragile pipelines from durable ones:

  • Idempotency: Running the pipeline twice produces the same result as running it once. Achieved by deterministic transformations and upsert-or-merge writes.
  • Incremental loading: Where possible, process only new or changed rows since the last successful run. Reduces cost and runtime dramatically.
  • Schema evolution: Source schemas change; the pipeline detects and accommodates new columns, dropped columns, and type changes without breaking.
  • Testing: Run automated checks on every pipeline output — uniqueness of primary keys, expected row counts, value ranges, foreign-key integrity. Tools like dbt tests, Great Expectations, and Soda automate this.
  • Lineage: Every analytical column traces back to its source columns through the pipeline. Modern data catalogues capture lineage automatically.
  • Error handling and retries: Transient failures (network, rate limits) should retry with exponential back-off; permanent failures should alert the owner.
  • Monitoring and SLAs: Define a service-level expectation for each pipeline (“daily refresh complete by 7:00 AM IST 99 % of the time”) and monitor against it.
  • Version-controlled code: Pipeline definitions in Git; reviewed via pull requests; deployed through CI/CD. The era of clicking through GUI ETL tools without version control is ending.
  • Documentation alongside code: Each pipeline carries an owner, a purpose, a schedule, and the dashboards or downstream tables it feeds.

44.7 Light ETL in Power BI and Excel

For smaller analytical contexts, BI tools include built-in light-ETL capability:

  • Power BI Dataflows: Cloud-native Power Query pipelines that run on a schedule and serve their results to multiple Power BI workbooks. Equivalent in spirit to a small dbt project, but with a no-code interface.

  • Power Query in Excel and Power BI Desktop: Workbook-level transformation pipelines (Chapter 35). Sufficient for individual analyst workflows but does not scale to enterprise-wide data engineering.

  • Tableau Prep: Tableau’s visual data-preparation tool. Builds reusable flows that can be scheduled on Tableau Server / Cloud and consumed by Tableau workbooks.

For analytical scope under a few hundred million rows and a moderate number of sources, these tools cover the data-engineering needs of most teams without a separate ETL platform. Beyond that scale, dedicated tools (Azure Data Factory, dbt + Snowflake, Airflow + Spark) become necessary.

44.8 Common Pitfalls

  • Manual Pipelines: A pipeline that requires the analyst to click “Refresh” by hand each morning; eventually the analyst forgets and the dashboard goes stale.
  • No Idempotency: Re-running fails or duplicates rows; recovery from partial failures is painful.
  • Full Loads Only: A pipeline that fully reloads 200 million rows daily when an incremental load would process 200 thousand; cost and runtime balloon.
  • No Tests: Pipeline succeeds technically but produces wrong data — duplicate keys, broken foreign-key relationships, suddenly missing values; the dashboard misleads silently.
  • No Monitoring: A failure overnight is discovered when leadership opens the dashboard the next morning.
  • Hard-Coded Credentials: Source passwords in scripts or workflow files; security incident waiting to happen. Use a secrets manager.
  • No Schema Evolution Handling: Source adds a new column and the pipeline fails; or removes a column and the pipeline silently drops the analytical metric.
  • Lineage Gap: Analytical columns whose origin no one can trace; debugging incidents takes hours.
  • Pipeline Sprawl: Hundreds of small pipelines built by individuals over years, with no shared catalogue or ownership.
  • No SLAs: Refresh time is “around morning” rather than a defined target; nobody is accountable for misses.

44.9 Illustrative Cases

A Modern dbt-on-Snowflake Pipeline

A retail firm migrates from SSIS to a modern stack: Fivetran extracts from twenty SaaS sources to Snowflake, dbt transforms in Snowflake, Airflow orchestrates the daily run, Power BI consumes the certified marts. Pipeline code lives in Git; tests run on every commit. The total runtime drops from six hours to forty minutes; the team’s productivity increases markedly.

A Failed Manual Pipeline

A marketing analyst’s weekly report depends on a CSV she manually exports from a SaaS tool every Monday morning. She is on holiday for two weeks; the report goes stale; leadership notices only when a competitor cites stale numbers in a meeting. The remediation: a Power BI Dataflow that pulls from the SaaS API on a schedule, eliminating the manual step entirely.

A Streaming Logistics Control Tower

A logistics firm’s control-tower dashboard requires sub-minute freshness on vehicle locations and route status. The team builds a streaming pipeline: vehicle telemetry into Kafka, Flink processes events and writes to a Snowflake real-time table, Power BI live-connects to the table. The streaming infrastructure costs more than a batch equivalent but is justified by the operational use case.


44.10 Hands-On Exercise: Building a Light ETL Pipeline with Power BI Dataflows

Aim: Build a Power BI Dataflow that ingests, cleans, and conforms data from two sources for Yuvijen Stores, then consume the dataflow from a Power BI dataset and a downstream report.

Deliverable: A Power BI Service workspace with a published Dataflow, a dataset that consumes it, and a report that reads from the dataset.

44.10.1 Step 1 — Create a Workspace and Open the Dataflow Editor

  1. In app.powerbi.com, navigate to a workspace (or create one for Yuvijen Sales Analytics if not already done in Chapter 38).
  2. Click + New → Dataflow.
  3. Choose Define new tables to start from scratch.

The Dataflow editor opens — it is essentially Power Query in the cloud.

44.10.2 Step 2 — Connect to the First Source

  1. Get data → choose Text/CSV (or whichever source applies).
  2. Provide the connection details (URL, credentials).
  3. The data preview appears. Apply transformations:
    • Use First Row as Headers.
    • Change Type with Locale on each column.
    • Trim and Clean text columns.
  4. Name the entity customers_clean.

44.10.3 Step 3 — Connect to the Second Source

  1. Repeat for the orders source.
  2. Apply the same cleaning discipline.
  3. Add a Conditional Column for an order size band.
  4. Name the entity orders_clean.

44.10.4 Step 4 — Conform Dimensions

  1. Add a third entity dim_date_clean — either generate a date table from a query or import one.
  2. Mark the date column with the appropriate type and locale.

The dataflow now has three conformed entities: customers, orders, date.

44.10.5 Step 5 — Configure Refresh Schedule

  1. Save and name the dataflow Yuvijen Sales — Cleansed Mart.
  2. Set a refresh schedule (e.g., Daily 5:00 AM IST).
  3. Configure failure notifications — the dataflow owner receives an email if any refresh fails.

44.10.6 Step 6 — Consume the Dataflow from a Power BI Desktop Workbook

  1. Open Power BI Desktop.
  2. Get Data → Power Platform → Dataflows.
  3. Sign in; navigate to the workspace and select the dataflow.
  4. Pick the three entities and load.
  5. Build the model — relationships, measures, visuals — exactly as in Chapter 36.

The Power BI workbook now consumes the cleaned, conformed dataflow rather than raw sources. Multiple workbooks across the firm can consume the same dataflow.

44.10.7 Step 7 — Add Quality Checks

In the dataflow editor, add a small monitoring entity:

  1. New entity called dq_checks.
  2. Use Power Query to count rows per source, count nulls per critical column, and detect duplicates on the customer ID.
  3. Each row of dq_checks is one quality metric.
  4. Build a small Power BI report on dq_checks to display refresh-time quality scorecards.

The pipeline is now self-monitoring: a quality regression flags itself in a dedicated dashboard.

44.10.8 Step 8 — Document the Pipeline

In the workspace, add a description to the dataflow:

  • Owner: name and contact.
  • Source systems: documented.
  • Refresh schedule: 5 AM IST daily.
  • Downstream consumers: list of workbooks.
  • SLA: refresh complete by 7 AM IST.

Documentation lives alongside the pipeline; the next analyst can find it without asking around.

44.10.9 Step 9 — Connect to the Visualisation Layer

The hands-on illustrates the pipeline-to-dashboard chain:

  • Sources stream into the dataflow.
  • The dataflow runs on a schedule, transforms, and stores cleaned tables in OneLake-backed storage.
  • Multiple Power BI datasets consume the same dataflow.
  • Multiple reports consume the datasets.
  • The visualisation layer the audience sees is the final, refreshed, governed output.

A dashboard’s freshness, accuracy, and trustworthiness are entirely a function of the pipeline behind it. Without a reliable pipeline, even the best-designed dashboard misleads. The pipeline is invisible to the audience; that is exactly why it must be built with the most discipline.

TipFiles and Screen Recordings

Screenshots of the dataflow definition, refresh history, the consuming Power BI workbook (yuvijen-from-dataflow.pbix), and the quality scorecard report will be embedded here.


Summary

Concept Description
Foundations
Why ETL Matters Every dashboard rests on a pipeline; the dashboard is only as fresh and reliable as the pipeline
ETL versus ELT versus ETLT
ETL Extract Transform Load; transform in staging before loading the warehouse
ELT Extract Load Transform; load raw to the warehouse and transform inside
ETLT Hybrid pattern with light cleaning before load and deeper modelling after
Modern Shift to ELT Cheap cloud storage and elastic warehouse compute moved transformation into the warehouse via SQL
Pipeline Components
Sources Operational databases, SaaS APIs, file shares, event streams
Extraction Layer Connectors that pull data from sources with credential management and rate-limit handling
Staging Area Holding ground for raw extracted data, often a data lake bronze layer
Transformation Engine Code or visual tool that cleans, joins, and shapes the data into analytical form
Target Store Warehouse, lake, or lakehouse the dashboard reads from
Orchestrator Scheduler and dependency manager that runs the pipeline reliably
Monitoring and Alerting Observability for failures, latency, data-quality regressions
Batch versus Streaming
Batch Pipeline Runs on a schedule processing accumulated data since the last run
Streaming Pipeline Processes events as they arrive within seconds; required for operational dashboards
Lambda Architecture Parallel batch and streaming pipelines feeding the same analytical store
Kappa Architecture Streaming-only pipeline with batch jobs treated as a special case
Tool Categories
Traditional ETL Tools Informatica, DataStage, SSIS, Pentaho, Talend
Cloud-Native ELT Tools Fivetran, Airbyte, Stitch, Hevo, Matillion, Azure Data Factory, AWS Glue
Transformation in Warehouse dbt, Dataform, SQLMesh, Coalesce
Streaming and CDC Tools Kafka, Confluent Cloud, Debezium, AWS DMS, Flink, Spark Structured Streaming
Workflow Orchestration Apache Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions
Light ETL for BI Power BI Dataflows, Power Query, Tableau Prep for moderate-scale analytical work
Data Catalogue Tools Microsoft Purview, Collibra, Alation, Atlan, OpenMetadata, DataHub
Best Practices
Idempotency Running the pipeline twice produces the same result as running once
Incremental Loading Process only new or changed rows since the last successful run
Schema Evolution Pipeline detects and accommodates schema changes without breaking
Pipeline Testing Automated checks on every output: uniqueness, row counts, value ranges, foreign keys
Lineage Every analytical column traces back to its source columns through the pipeline
Error Handling and Retries Transient failures retry with back-off; permanent failures alert the owner
Monitoring and SLAs Service-level expectation per pipeline measured and monitored
Version-Controlled Code Pipeline definitions in Git; reviewed via pull requests; deployed through CI/CD
Documentation Alongside Code Each pipeline carries owner, purpose, schedule, downstream consumers
Light ETL for BI Tools
Power BI Dataflows Cloud-native Power Query pipelines that run on a schedule and serve multiple workbooks
Power Query Light ETL Workbook-level transformation pipelines sufficient for individual analyst workflows
Tableau Prep Tableau's visual data-preparation tool with reusable scheduled flows
Common Pitfalls
Manual Pipelines Pitfall Pitfall of pipeline requiring manual click; eventually forgotten and dashboard goes stale
No Idempotency Pitfall of re-runs failing or duplicating rows; painful recovery from partial failures
Full Loads Only Pitfall of fully reloading hundreds of millions of rows when incremental would process thousands
No Tests Pitfall of pipeline succeeding technically but producing wrong data silently
No Monitoring Pitfall of overnight failure discovered only when leadership opens the dashboard
Hard-Coded Credentials Pitfall of source passwords in scripts; security incident waiting to happen
No Schema Evolution Handling Pitfall of source schema change breaking the pipeline or silently dropping a metric
Lineage Gap Pitfall of analytical columns whose origin no one can trace; debugging takes hours
Pipeline Sprawl Pitfall of hundreds of small pipelines built individually with no catalogue or ownership
No SLAs Pitfall of refresh time being around morning rather than a defined target