46 Real-Time Analytics and Streaming Data

46.1 Why Real-Time Analytics Matters

Some business decisions cannot wait for tomorrow’s batch refresh.

A logistics control tower needs to see vehicle positions within seconds. A fraud-detection system has to score a transaction before it clears. A factory’s predictive-maintenance dashboard cannot wait for an overnight refresh; the bearing fails in three minutes. Real-time analytics is the architecture that handles these cases — data is processed and visualised as it arrives, not as a daily batch.

The standard reference on streaming systems is Streaming Systems by Tyler Akidau et al. (2018), which codifies the dataflow model behind modern stream processors. The broader treatment of streaming as part of data systems is Designing Data-Intensive Applications by Martin Kleppmann (2017), an indispensable reference for any analyst working at the intersection of analytics and data engineering.

For a visualisation-focused book, this chapter covers the small but consequential set of dashboards that genuinely need sub-minute freshness. Most dashboards do not; the few that do, often matter disproportionately.

46.2 Real-Time, Near-Real-Time, and Batch

flowchart LR
    B["Batch<br>Hours to days<br>Daily reports,<br>monthly board pack"] --> N["Near-Real-Time<br>Seconds to minutes<br>Operational dashboards,<br>monitoring screens"]
    N --> R["Real-Time<br>Sub-second<br>Fraud scoring,<br>algorithmic trading"]
    style B fill:#e8f5e9,stroke:#388E3C
    style N fill:#fff8e1,stroke:#F9A825
    style R fill:#fce4ec,stroke:#AD1457

The Latency Spectrum

Tier	Latency	Use Cases	Cost Profile
Batch	Hours to days	Daily reports, monthly board pack, financial close	Lowest — runs once per period
Near-Real-Time	Seconds to minutes	Operational dashboards, control towers, monitoring screens	Medium — streaming infrastructure but bounded freshness
Real-Time	Sub-second	Fraud scoring, algorithmic trading, real-time personalisation	Highest — full streaming pipelines, hot caches, ML inference

For most analytical workloads, batch is the right answer; freshness beyond what batch provides is rarely worth the operational cost. The honest discipline is to ask: what decision becomes wrong if the data is an hour old? If none, batch suffices.

46.3 Streaming Architecture

A real-time analytics pipeline has four distinct layers:

Streaming Architecture Layers

Layer	Role	Examples
Event Sources	Produce events as they happen	Operational databases (via CDC), IoT sensors, web/app clickstreams, transaction systems
Stream Brokers	Buffer and route events durably	Apache Kafka, Apache Pulsar, AWS Kinesis, Azure Event Hubs, Google Pub/Sub
Stream Processors	Transform events as they flow	Apache Flink, Spark Structured Streaming, Kafka Streams, ksqlDB
Real-Time Stores	Serve aggregated state to dashboards	Materialize, RisingWave, Apache Druid, ClickHouse, Pinot, time-series DBs (InfluxDB, TimescaleDB)

A simplified end-to-end flow: an event is produced (a customer pays at a POS terminal); it is published to a Kafka topic; Flink processes the event in a windowed aggregation (running per-store sales by minute); the result is written to a real-time store; the BI dashboard reads from the store and refreshes within seconds.

46.4 Change Data Capture

Change Data Capture (CDC) is the technique of detecting and propagating changes (inserts, updates, deletes) from a source database as they happen. CDC is the most common bridge between operational systems and streaming pipelines.

Common CDC tools:

Debezium — open-source CDC for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB; emits to Kafka.
AWS DMS — managed CDC service in AWS.
Oracle GoldenGate — enterprise CDC for Oracle ecosystems.
Striim, HVR, Qlik Replicate — commercial alternatives.
Native database features — SQL Server’s Change Tracking, Oracle’s Logminer.

CDC complements batch ETL: a CDC stream pushes incremental changes; a periodic batch full-load reconciles for any drift. Together they form the backbone of most modern hybrid data pipelines.

46.5 Lambda and Kappa Architectures

Two architectural patterns reconcile streaming with batch:

Lambda Architecture (Nathan Marz): Parallel batch and streaming pipelines feeding the same analytical store. The batch layer provides correctness over the long horizon; the streaming layer provides freshness for the recent window. The serving layer reconciles both.
Kappa Architecture (Jay Kreps): Streaming-only. Batch jobs are treated as a special case of streaming on bounded data. Simpler operationally but demands a mature streaming platform.

Most modern Microsoft Fabric / Snowflake / Databricks deployments lean toward Kappa-style patterns: a single pipeline that scales from streaming to batch through the same primitives. Lambda persists in older or hybrid environments.

46.6 Real-Time Visualisation in BI Tools

The streaming infrastructure is upstream; the dashboard is the consumption surface. Different BI tools support real-time differently.

Real-Time Capabilities by BI Tool

Tool	Real-Time Mechanism	Latency Achievable
Power BI Streaming Datasets	API-pushed events that visuals subscribe to	Sub-second
Power BI DirectQuery	Live SQL against the source on every visual refresh	Seconds (depends on source)
Power BI Direct Lake (Fabric)	Native parquet read from OneLake without extract	Seconds
Tableau Live Connection	Live SQL against the source	Seconds
Grafana	Native time-series visualisation, auto-refresh	Sub-second to seconds
Kibana	Elasticsearch-backed; auto-refreshing visuals	Sub-second to seconds
Apache Superset	Open-source BI with auto-refresh and SQL Lab	Seconds
Real-Time Tableau Server	Subscribe-to-changes via webhooks	Seconds

The right tool depends on the use case:

Power BI Streaming Datasets for executive monitoring and KPI tickers.
Power BI DirectQuery or Direct Lake for sub-minute analytical dashboards on cloud warehouses.
Grafana or Kibana for operational and infrastructure monitoring (logs, metrics, traces).
Tableau Live for analyst-facing dashboards on fast warehouses.

46.7 Use Cases for Real-Time Analytics

When Real-Time Earns Its Cost

Use Case	Why Real-Time Matters
Operations Control Tower	Vehicle, asset, route status updates within seconds
Fraud Detection	Score a transaction before it clears
IoT and Predictive Maintenance	Detect equipment failure within seconds of sensor anomaly
Real-Time Personalisation	Adjust pricing or recommendations during the user’s session
Algorithmic Trading	Sub-millisecond market-data response
Live Sports and Media	Game scores, viewer counts, ad metrics
Cybersecurity	Detect intrusions while they happen
Customer Support Triage	Surface a customer’s full context as the call begins
Network Operations Centres	Telecom and infrastructure monitoring

For most other dashboards — quarterly reviews, monthly board packs, weekly performance — batch is sufficient and far cheaper. The discipline is to apply real-time only where it genuinely changes the decision.

46.8 Performance and Cost Considerations

Real-time architectures cost more than batch in three ways:

Infrastructure: Always-on streaming platforms (Kafka clusters, Flink jobs, real-time stores) cost more than scheduled batch jobs.
Engineering complexity: Stream processing handles late events, exactly-once semantics, schema evolution, and back-pressure — concerns batch jobs do not face.
Operational overhead: A streaming pipeline running 24/7 needs monitoring, on-call rotation, and runbooks.

Cost-optimisation patterns:

Use streaming only where it earns the latency: For dashboards that refresh hourly, use batch.
Materialise aggregates: A real-time store with one-minute aggregates is cheaper than full-event storage.
Tiered storage: Hot recent data in fast stores, older data in cheaper batch warehouses.
Right-size the streaming infrastructure: Kafka cluster scaled to actual throughput; Flink parallelism tuned to the workload.

46.9 Best Practices

Validate the latency requirement: Ask precisely what decision changes if the data is an hour old. If none, batch is correct.
Decouple producer and consumer through a broker: Direct database-to-dashboard connections are fragile under load; Kafka or Event Hubs provides durability and replay.
Plan for late events: Watermarks and triggers in Flink/Beam handle late arrivals; ignore them and aggregations are wrong.
Idempotent consumers: Replays should not double-count; use deduplication keys or upsert semantics.
Schema registry: Avro or Protobuf with a schema registry prevents the producer-consumer schema drift that breaks streaming pipelines silently.
Backfill plan: A streaming pipeline that has been running for six months will still need to handle historical loads — schema migrations, recomputed aggregates, model retraining on past data.
Real-time governance: The same data-quality, lineage, and access-control discipline applies to streaming as to batch; do not relax governance because the data is faster.
Document the latency SLA: Dashboard reflects events within 90 seconds 99 % of the time. Without this, expectations drift.

46.10 Common Pitfalls

Real-Time Where Batch Suffices: Building expensive streaming infrastructure for a quarterly board pack.
No Late-Event Handling: Stream processor produces hourly aggregates that lose late-arriving events; numbers under-count silently.
Producer-Consumer Schema Drift: Producer adds a field; downstream consumer breaks because no schema registry enforces compatibility.
No Replay Path: Streaming pipeline fails for two hours; no way to backfill the lost period.
Direct Database Hits at Scale: Every dashboard query hits the live operational database; OLTP performance degrades.
Hot-Cold Mismatch: Real-time store full of years of events when only the last hour matters; cost balloons.
Streaming Without Monitoring: Pipeline fails silently; the dashboard freezes at last-known values; users discover hours later.
No Idempotency: Replay double-counts events; numbers diverge from batch reconciliation.
Over-Engineering: Adopting Lambda or Kappa when a single batch ETL would meet the actual freshness need.
Streaming for Status Updates Only: Building real-time pipelines for operational status when a 60-second polling refresh on a smaller batch would suffice.

46.11 Illustrative Cases

A Logistics Control Tower

A logistics firm operates a fleet of 5,000 vehicles across India. Driver telemetry — GPS, speed, route progress, exception flags — streams from each vehicle to Azure Event Hubs every 30 seconds. Azure Stream Analytics processes the events and writes per-route summaries to Azure Synapse. A Power BI dashboard with DirectQuery shows the live fleet on a map with colour-coded route status. The control room responds to exceptions within 60 seconds.

A Bank’s Fraud-Scoring Pipeline

A retail bank scores every card-not-present transaction in under 200 ms. The transaction event flows from the payment switch to a Kafka topic; a Flink job applies a pre-trained ML model and writes the score to a real-time decision service. A Grafana dashboard for the fraud-operations team shows the rate of high-risk scores per minute; spikes prompt analyst review. The dashboard is the visualisation layer; the streaming infrastructure is the engine.

A Manufacturing Sensor Stream

A manufacturing plant instruments a critical line with 200 vibration and temperature sensors. Telemetry streams into a TimescaleDB cluster at 10,000 events per second. A Grafana dashboard shows real-time vibration spectra and triggered alerts. When an anomaly fires, the maintenance team has 15–30 minutes of advance warning before the equipment fails — orders of magnitude better than the daily PM cycle that preceded it.

46.12 Hands-On Exercise: Building a Power BI Streaming Dashboard

Aim: Build a real-time Power BI dashboard for Yuvijen Stores using a Streaming Dataset, producing events from a small simulator and consuming them in a tile that refreshes within seconds.

Deliverable: A streaming dataset, a Power BI dashboard with one streaming tile, and a Python (or Power Automate) producer that pushes events to the dataset.

46.12.1 Step 1 — Create the Streaming Dataset

In app.powerbi.com, navigate to a workspace.
+ New → Streaming dataset.
Choose API as the source.
Define fields:
- timestamp (DateTime)
- store_id (Text)
- transaction_count (Number)
- revenue (Number)
Toggle Historic data analysis ON if you want to keep history (uses a small underlying dataset behind the streaming dataset).
Click Create.

Power BI returns:

A Push URL for the API (a long URL).
A code snippet for posting events.

46.12.2 Step 2 — Create the Dashboard and Add a Streaming Tile

+ New → Dashboard, name it Yuvijen Live Operations.
Click Edit → Add a tile → Custom Streaming Data.
Select the streaming dataset created above.
Choose visualisation type — Card for total revenue, Line chart for revenue over time.
Configure the time window (e.g., last 1 minute for a live ticker).
Save the tile.

The tile appears on the dashboard, ready to receive data.

46.12.3 Step 3 — Build a Producer Script

A simple Python producer that posts random events to the Push URL:

import requests, json, random, time
from datetime import datetime

URL = "https://api.powerbi.com/beta/.../rows?key=..."  # paste from Step 1

stores = ["S01", "S02", "S03", "S04"]

while True:
    event = {
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "store_id": random.choice(stores),
        "transaction_count": random.randint(1, 5),
        "revenue": round(random.uniform(200, 2500), 2)
    }
    response = requests.post(URL, data=json.dumps([event]))
    print(response.status_code, event)
    time.sleep(1)

Save as producer.py and run with python producer.py. Each event posts to Power BI within milliseconds.

If Python is unavailable, the same effect can be achieved with Power Automate: create a flow with a Recurrence trigger (every minute) and a Power BI - Add rows to streaming dataset action.

46.12.4 Step 4 — Watch the Dashboard Update

With the producer running, open the dashboard. The streaming tile updates within seconds as events arrive. The line chart accumulates points; the card refreshes its current value.

This is the production-equivalent pattern: a real source (POS terminals, IoT sensors, application events) replaces the producer, the rest is the same.

46.12.5 Step 5 — Add a Threshold Alert

Click the streaming tile → Manage Alerts.
+ Add alert rule.
Configure: When revenue tile drops below 1000, notify me.
Save.

When the streaming data falls below the threshold, Power BI sends an alert email and a notification on the mobile app.

46.12.6 Step 6 — Connect to a Push Path with Power Automate

For broader integration, route the alert through Power Automate:

The alert rule above has a built-in option Use Microsoft Power Automate to trigger additional actions.
Click → opens the Power Automate flow editor.
Add actions: Post message in a Teams channel with the alert details, or Send an email.
Save.

The streaming dataset, the dashboard, the alert, and the Teams notification together form a complete real-time operational loop.

46.12.7 Step 7 — Decide When This Is the Right Pattern

The hands-on illustrates a near-real-time pattern that costs little to build and run. It is the right choice when:

The data source naturally produces events (POS, IoT, application events).
The audience needs sub-minute freshness on a small set of metrics.
The volume is moderate (tens of events per second per dataset).

For higher volumes, broader analytical scope, or sub-second latency requirements, dedicated streaming infrastructure (Kafka + Flink + a real-time store + Grafana) is the right answer instead.

46.12.8 Step 8 — Connect to the Visualisation Layer

The hands-on closes the loop on the BI Architecture block:

Chapter 44 showed batch ETL pipelines.
Chapter 45 showed warehouse design that those pipelines feed.
This chapter shows the streaming alternative for the small set of dashboards that genuinely need it.

The dashboard the audience sees is identical in spirit to a batch dashboard; what differs is the freshness of the underlying data and the architectural cost behind it. Real-time is a tool; the chart-selection, dashboard-design, and visual-perception principles from Module 2 apply just the same.

Files and Screen Recordings

The producer script (producer.py), screenshots of the streaming dataset definition and the dashboard updating, and a short screen recording showing live updates and the threshold alert firing will be embedded here.

Summary

Concept	Description
Foundations
Why Real-Time Analytics Matters	Some business decisions cannot wait for tomorrow's batch refresh
Latency Spectrum
Batch	Hours-to-days latency; daily reports, monthly board pack, financial close
Near-Real-Time	Seconds-to-minutes latency; operational dashboards, control towers, monitoring screens
Real-Time	Sub-second latency; fraud scoring, algorithmic trading, real-time personalisation
Streaming Architecture Layers
Event Sources	Operational databases via CDC, IoT sensors, web and app clickstreams, transaction systems
Stream Brokers	Buffer and route events durably between producers and consumers
Stream Processors	Transform events as they flow through windowed aggregations and joins
Real-Time Stores	Serve aggregated state to dashboards with sub-second response
Stream Brokers
Apache Kafka	Most widely deployed open-source distributed event broker
Apache Pulsar	Apache stream broker with multi-tenancy and tiered storage
AWS Kinesis	AWS managed event streaming service
Azure Event Hubs	Azure managed event streaming service
Google Pub/Sub	Google Cloud managed event streaming service
Stream Processors
Apache Flink	Stateful stream processor with strong correctness guarantees
Spark Structured Streaming	Spark's streaming API on top of the same DataFrame engine as batch
Kafka Streams	Library for stream processing applications running on Kafka
ksqlDB	SQL streaming engine on top of Kafka
Real-Time Stores
Materialize and RisingWave	Streaming SQL platforms with materialised views over events
Apache Druid	Open-source real-time analytics database with sub-second OLAP
ClickHouse and Pinot	Open-source columnar databases for fast analytics on streaming data
Time-Series Databases	InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry
Change Data Capture
Change Data Capture	Detect and propagate database changes as they happen
Debezium	Open-source CDC for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB
AWS DMS	AWS managed CDC service
Oracle GoldenGate	Enterprise CDC for Oracle ecosystems
Architecture Patterns
Lambda Architecture	Parallel batch and streaming pipelines feeding the same analytical store
Kappa Architecture	Streaming-only pipeline with batch jobs treated as bounded streams
Real-Time Visualisation in BI
Power BI Streaming Datasets	API-pushed events that visuals subscribe to with sub-second updates
Power BI DirectQuery	Live SQL against the source on every visual refresh
Power BI Direct Lake	Native parquet read from OneLake without extract; Fabric feature
Tableau Live Connection	Live SQL against the source for analyst-facing dashboards on fast warehouses
Grafana	Native time-series visualisation tool with auto-refresh; the operations standard
Kibana	Elasticsearch-backed visualisation with auto-refresh; standard for log analytics
Apache Superset	Open-source BI with auto-refresh and SQL Lab
Use Cases for Real-Time
Operations Control Tower	Vehicle, asset, route status updates within seconds for logistics and operations
Fraud Detection	Score a transaction before it clears for retail banking and payments
IoT and Predictive Maintenance	Detect equipment failure within seconds of sensor anomaly for factories
Real-Time Personalisation	Adjust pricing or recommendations during the user's session for digital products
Algorithmic Trading	Sub-millisecond market-data response for capital markets
Live Sports and Media	Game scores, viewer counts, ad metrics for live broadcasting
Cybersecurity	Detect intrusions while they happen for security operations
Customer Support Triage	Surface a customer's full context as the call begins for service desks
Network Operations Centres	Telecom and infrastructure monitoring at very high event rates
Cost Considerations
Infrastructure Cost	Always-on streaming platforms cost more than scheduled batch
Engineering Complexity Cost	Stream processing handles late events, exactly-once semantics, back-pressure
Operational Overhead	A streaming pipeline running 24/7 needs monitoring and on-call rotation
Use Streaming Only Where It Earns	Use streaming only where it earns the latency cost; batch otherwise
Materialise Aggregates	A real-time store with one-minute aggregates is cheaper than full-event storage
Tiered Storage	Hot recent data in fast stores, older data in cheaper batch warehouses
Right-Size Infrastructure	Kafka cluster scaled to actual throughput; Flink parallelism tuned to workload
Best Practices
Validate Latency Requirement	Ask precisely what decision changes if the data is an hour old
Decouple via Broker	Direct database-to-dashboard connections are fragile; brokers provide durability and replay
Plan for Late Events	Watermarks and triggers handle late arrivals; ignore them and aggregations are wrong
Idempotent Consumers	Replays should not double-count; use deduplication keys or upsert semantics
Schema Registry	Avro or Protobuf with a registry prevents producer-consumer schema drift
Backfill Plan	A streaming pipeline still needs to handle historical loads, recomputed aggregates
Real-Time Governance	Same data-quality, lineage, access-control discipline applies to streaming as batch
Latency SLA	Document the latency SLA explicitly so expectations do not drift
Common Pitfalls
Real-Time Where Batch Suffices	Pitfall of building expensive streaming infrastructure for a quarterly board pack
No Late-Event Handling	Pitfall of stream processor producing aggregates that lose late-arriving events
Schema Drift	Pitfall of producer adding a field and consumer breaking with no schema registry
No Replay Path	Pitfall of streaming pipeline failing with no way to backfill the lost period
Direct Database Hits	Pitfall of every dashboard hitting the live OLTP database and degrading it
Hot-Cold Mismatch	Pitfall of real-time store full of years of events when only the last hour matters
Streaming Without Monitoring	Pitfall of pipeline failing silently; dashboard freezes; users discover hours later
No Idempotency	Pitfall of replays double-counting and diverging from batch reconciliation
Over-Engineering	Pitfall of adopting Lambda or Kappa when a single batch ETL would meet freshness need
Streaming for Trivial Status	Pitfall of building real-time for status when a polling refresh would suffice