flowchart LR
B["Batch<br>Hours to days<br>Daily reports,<br>monthly board pack"] --> N["Near-Real-Time<br>Seconds to minutes<br>Operational dashboards,<br>monitoring screens"]
N --> R["Real-Time<br>Sub-second<br>Fraud scoring,<br>algorithmic trading"]
style B fill:#e8f5e9,stroke:#388E3C
style N fill:#fff8e1,stroke:#F9A825
style R fill:#fce4ec,stroke:#AD1457
46 Real-Time Analytics and Streaming Data
46.1 Why Real-Time Analytics Matters
Some business decisions cannot wait for tomorrow’s batch refresh.
A logistics control tower needs to see vehicle positions within seconds. A fraud-detection system has to score a transaction before it clears. A factory’s predictive-maintenance dashboard cannot wait for an overnight refresh; the bearing fails in three minutes. Real-time analytics is the architecture that handles these cases — data is processed and visualised as it arrives, not as a daily batch.
The standard reference on streaming systems is Streaming Systems by Tyler Akidau et al. (2018), which codifies the dataflow model behind modern stream processors. The broader treatment of streaming as part of data systems is Designing Data-Intensive Applications by Martin Kleppmann (2017), an indispensable reference for any analyst working at the intersection of analytics and data engineering.
For a visualisation-focused book, this chapter covers the small but consequential set of dashboards that genuinely need sub-minute freshness. Most dashboards do not; the few that do, often matter disproportionately.
46.2 Real-Time, Near-Real-Time, and Batch
| Tier | Latency | Use Cases | Cost Profile |
|---|---|---|---|
| Batch | Hours to days | Daily reports, monthly board pack, financial close | Lowest — runs once per period |
| Near-Real-Time | Seconds to minutes | Operational dashboards, control towers, monitoring screens | Medium — streaming infrastructure but bounded freshness |
| Real-Time | Sub-second | Fraud scoring, algorithmic trading, real-time personalisation | Highest — full streaming pipelines, hot caches, ML inference |
For most analytical workloads, batch is the right answer; freshness beyond what batch provides is rarely worth the operational cost. The honest discipline is to ask: what decision becomes wrong if the data is an hour old? If none, batch suffices.
46.3 Streaming Architecture
A real-time analytics pipeline has four distinct layers:
| Layer | Role | Examples |
|---|---|---|
| Event Sources | Produce events as they happen | Operational databases (via CDC), IoT sensors, web/app clickstreams, transaction systems |
| Stream Brokers | Buffer and route events durably | Apache Kafka, Apache Pulsar, AWS Kinesis, Azure Event Hubs, Google Pub/Sub |
| Stream Processors | Transform events as they flow | Apache Flink, Spark Structured Streaming, Kafka Streams, ksqlDB |
| Real-Time Stores | Serve aggregated state to dashboards | Materialize, RisingWave, Apache Druid, ClickHouse, Pinot, time-series DBs (InfluxDB, TimescaleDB) |
A simplified end-to-end flow: an event is produced (a customer pays at a POS terminal); it is published to a Kafka topic; Flink processes the event in a windowed aggregation (running per-store sales by minute); the result is written to a real-time store; the BI dashboard reads from the store and refreshes within seconds.
46.4 Change Data Capture
Change Data Capture (CDC) is the technique of detecting and propagating changes (inserts, updates, deletes) from a source database as they happen. CDC is the most common bridge between operational systems and streaming pipelines.
Common CDC tools:
- Debezium — open-source CDC for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB; emits to Kafka.
- AWS DMS — managed CDC service in AWS.
- Oracle GoldenGate — enterprise CDC for Oracle ecosystems.
- Striim, HVR, Qlik Replicate — commercial alternatives.
- Native database features — SQL Server’s Change Tracking, Oracle’s Logminer.
CDC complements batch ETL: a CDC stream pushes incremental changes; a periodic batch full-load reconciles for any drift. Together they form the backbone of most modern hybrid data pipelines.
46.5 Lambda and Kappa Architectures
Two architectural patterns reconcile streaming with batch:
Lambda Architecture (Nathan Marz): Parallel batch and streaming pipelines feeding the same analytical store. The batch layer provides correctness over the long horizon; the streaming layer provides freshness for the recent window. The serving layer reconciles both.
Kappa Architecture (Jay Kreps): Streaming-only. Batch jobs are treated as a special case of streaming on bounded data. Simpler operationally but demands a mature streaming platform.
Most modern Microsoft Fabric / Snowflake / Databricks deployments lean toward Kappa-style patterns: a single pipeline that scales from streaming to batch through the same primitives. Lambda persists in older or hybrid environments.
46.6 Real-Time Visualisation in BI Tools
The streaming infrastructure is upstream; the dashboard is the consumption surface. Different BI tools support real-time differently.
| Tool | Real-Time Mechanism | Latency Achievable |
|---|---|---|
| Power BI Streaming Datasets | API-pushed events that visuals subscribe to | Sub-second |
| Power BI DirectQuery | Live SQL against the source on every visual refresh | Seconds (depends on source) |
| Power BI Direct Lake (Fabric) | Native parquet read from OneLake without extract | Seconds |
| Tableau Live Connection | Live SQL against the source | Seconds |
| Grafana | Native time-series visualisation, auto-refresh | Sub-second to seconds |
| Kibana | Elasticsearch-backed; auto-refreshing visuals | Sub-second to seconds |
| Apache Superset | Open-source BI with auto-refresh and SQL Lab | Seconds |
| Real-Time Tableau Server | Subscribe-to-changes via webhooks | Seconds |
The right tool depends on the use case:
- Power BI Streaming Datasets for executive monitoring and KPI tickers.
- Power BI DirectQuery or Direct Lake for sub-minute analytical dashboards on cloud warehouses.
- Grafana or Kibana for operational and infrastructure monitoring (logs, metrics, traces).
- Tableau Live for analyst-facing dashboards on fast warehouses.
46.7 Use Cases for Real-Time Analytics
| Use Case | Why Real-Time Matters |
|---|---|
| Operations Control Tower | Vehicle, asset, route status updates within seconds |
| Fraud Detection | Score a transaction before it clears |
| IoT and Predictive Maintenance | Detect equipment failure within seconds of sensor anomaly |
| Real-Time Personalisation | Adjust pricing or recommendations during the user’s session |
| Algorithmic Trading | Sub-millisecond market-data response |
| Live Sports and Media | Game scores, viewer counts, ad metrics |
| Cybersecurity | Detect intrusions while they happen |
| Customer Support Triage | Surface a customer’s full context as the call begins |
| Network Operations Centres | Telecom and infrastructure monitoring |
For most other dashboards — quarterly reviews, monthly board packs, weekly performance — batch is sufficient and far cheaper. The discipline is to apply real-time only where it genuinely changes the decision.
46.8 Performance and Cost Considerations
Real-time architectures cost more than batch in three ways:
- Infrastructure: Always-on streaming platforms (Kafka clusters, Flink jobs, real-time stores) cost more than scheduled batch jobs.
- Engineering complexity: Stream processing handles late events, exactly-once semantics, schema evolution, and back-pressure — concerns batch jobs do not face.
- Operational overhead: A streaming pipeline running 24/7 needs monitoring, on-call rotation, and runbooks.
Cost-optimisation patterns:
- Use streaming only where it earns the latency: For dashboards that refresh hourly, use batch.
- Materialise aggregates: A real-time store with one-minute aggregates is cheaper than full-event storage.
- Tiered storage: Hot recent data in fast stores, older data in cheaper batch warehouses.
- Right-size the streaming infrastructure: Kafka cluster scaled to actual throughput; Flink parallelism tuned to the workload.
46.9 Best Practices
- Validate the latency requirement: Ask precisely what decision changes if the data is an hour old. If none, batch is correct.
- Decouple producer and consumer through a broker: Direct database-to-dashboard connections are fragile under load; Kafka or Event Hubs provides durability and replay.
- Plan for late events: Watermarks and triggers in Flink/Beam handle late arrivals; ignore them and aggregations are wrong.
- Idempotent consumers: Replays should not double-count; use deduplication keys or upsert semantics.
- Schema registry: Avro or Protobuf with a schema registry prevents the producer-consumer schema drift that breaks streaming pipelines silently.
- Backfill plan: A streaming pipeline that has been running for six months will still need to handle historical loads — schema migrations, recomputed aggregates, model retraining on past data.
- Real-time governance: The same data-quality, lineage, and access-control discipline applies to streaming as to batch; do not relax governance because the data is faster.
- Document the latency SLA: Dashboard reflects events within 90 seconds 99 % of the time. Without this, expectations drift.
46.10 Common Pitfalls
- Real-Time Where Batch Suffices: Building expensive streaming infrastructure for a quarterly board pack.
- No Late-Event Handling: Stream processor produces hourly aggregates that lose late-arriving events; numbers under-count silently.
- Producer-Consumer Schema Drift: Producer adds a field; downstream consumer breaks because no schema registry enforces compatibility.
- No Replay Path: Streaming pipeline fails for two hours; no way to backfill the lost period.
- Direct Database Hits at Scale: Every dashboard query hits the live operational database; OLTP performance degrades.
- Hot-Cold Mismatch: Real-time store full of years of events when only the last hour matters; cost balloons.
- Streaming Without Monitoring: Pipeline fails silently; the dashboard freezes at last-known values; users discover hours later.
- No Idempotency: Replay double-counts events; numbers diverge from batch reconciliation.
- Over-Engineering: Adopting Lambda or Kappa when a single batch ETL would meet the actual freshness need.
- Streaming for Status Updates Only: Building real-time pipelines for operational status when a 60-second polling refresh on a smaller batch would suffice.
46.11 Illustrative Cases
A Logistics Control Tower
A logistics firm operates a fleet of 5,000 vehicles across India. Driver telemetry — GPS, speed, route progress, exception flags — streams from each vehicle to Azure Event Hubs every 30 seconds. Azure Stream Analytics processes the events and writes per-route summaries to Azure Synapse. A Power BI dashboard with DirectQuery shows the live fleet on a map with colour-coded route status. The control room responds to exceptions within 60 seconds.
A Bank’s Fraud-Scoring Pipeline
A retail bank scores every card-not-present transaction in under 200 ms. The transaction event flows from the payment switch to a Kafka topic; a Flink job applies a pre-trained ML model and writes the score to a real-time decision service. A Grafana dashboard for the fraud-operations team shows the rate of high-risk scores per minute; spikes prompt analyst review. The dashboard is the visualisation layer; the streaming infrastructure is the engine.
A Manufacturing Sensor Stream
A manufacturing plant instruments a critical line with 200 vibration and temperature sensors. Telemetry streams into a TimescaleDB cluster at 10,000 events per second. A Grafana dashboard shows real-time vibration spectra and triggered alerts. When an anomaly fires, the maintenance team has 15–30 minutes of advance warning before the equipment fails — orders of magnitude better than the daily PM cycle that preceded it.
46.12 Hands-On Exercise: Building a Power BI Streaming Dashboard
Aim: Build a real-time Power BI dashboard for Yuvijen Stores using a Streaming Dataset, producing events from a small simulator and consuming them in a tile that refreshes within seconds.
Deliverable: A streaming dataset, a Power BI dashboard with one streaming tile, and a Python (or Power Automate) producer that pushes events to the dataset.
46.12.1 Step 1 — Create the Streaming Dataset
- In app.powerbi.com, navigate to a workspace.
- + New → Streaming dataset.
- Choose API as the source.
- Define fields:
-
timestamp(DateTime) -
store_id(Text) -
transaction_count(Number) -
revenue(Number)
-
- Toggle Historic data analysis ON if you want to keep history (uses a small underlying dataset behind the streaming dataset).
- Click Create.
Power BI returns:
- A Push URL for the API (a long URL).
- A code snippet for posting events.
46.12.2 Step 2 — Create the Dashboard and Add a Streaming Tile
-
+ New → Dashboard, name it
Yuvijen Live Operations. - Click Edit → Add a tile → Custom Streaming Data.
- Select the streaming dataset created above.
- Choose visualisation type — Card for total revenue, Line chart for revenue over time.
- Configure the time window (e.g., last 1 minute for a live ticker).
- Save the tile.
The tile appears on the dashboard, ready to receive data.
46.12.3 Step 3 — Build a Producer Script
A simple Python producer that posts random events to the Push URL:
import requests, json, random, time
from datetime import datetime
URL = "https://api.powerbi.com/beta/.../rows?key=..." # paste from Step 1
stores = ["S01", "S02", "S03", "S04"]
while True:
event = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"store_id": random.choice(stores),
"transaction_count": random.randint(1, 5),
"revenue": round(random.uniform(200, 2500), 2)
}
response = requests.post(URL, data=json.dumps([event]))
print(response.status_code, event)
time.sleep(1)Save as producer.py and run with python producer.py. Each event posts to Power BI within milliseconds.
If Python is unavailable, the same effect can be achieved with Power Automate: create a flow with a Recurrence trigger (every minute) and a Power BI - Add rows to streaming dataset action.
46.12.4 Step 4 — Watch the Dashboard Update
With the producer running, open the dashboard. The streaming tile updates within seconds as events arrive. The line chart accumulates points; the card refreshes its current value.
This is the production-equivalent pattern: a real source (POS terminals, IoT sensors, application events) replaces the producer, the rest is the same.
46.12.5 Step 5 — Add a Threshold Alert
- Click the streaming tile → Manage Alerts.
- + Add alert rule.
- Configure: When revenue tile drops below 1000, notify me.
- Save.
When the streaming data falls below the threshold, Power BI sends an alert email and a notification on the mobile app.
46.12.6 Step 6 — Connect to a Push Path with Power Automate
For broader integration, route the alert through Power Automate:
- The alert rule above has a built-in option Use Microsoft Power Automate to trigger additional actions.
- Click → opens the Power Automate flow editor.
- Add actions: Post message in a Teams channel with the alert details, or Send an email.
- Save.
The streaming dataset, the dashboard, the alert, and the Teams notification together form a complete real-time operational loop.
46.12.7 Step 7 — Decide When This Is the Right Pattern
The hands-on illustrates a near-real-time pattern that costs little to build and run. It is the right choice when:
- The data source naturally produces events (POS, IoT, application events).
- The audience needs sub-minute freshness on a small set of metrics.
- The volume is moderate (tens of events per second per dataset).
For higher volumes, broader analytical scope, or sub-second latency requirements, dedicated streaming infrastructure (Kafka + Flink + a real-time store + Grafana) is the right answer instead.
46.12.8 Step 8 — Connect to the Visualisation Layer
The hands-on closes the loop on the BI Architecture block:
- Chapter 44 showed batch ETL pipelines.
- Chapter 45 showed warehouse design that those pipelines feed.
- This chapter shows the streaming alternative for the small set of dashboards that genuinely need it.
The dashboard the audience sees is identical in spirit to a batch dashboard; what differs is the freshness of the underlying data and the architectural cost behind it. Real-time is a tool; the chart-selection, dashboard-design, and visual-perception principles from Module 2 apply just the same.
The producer script (producer.py), screenshots of the streaming dataset definition and the dashboard updating, and a short screen recording showing live updates and the threshold alert firing will be embedded here.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why Real-Time Analytics Matters | Some business decisions cannot wait for tomorrow's batch refresh |
| Latency Spectrum | |
| Batch | Hours-to-days latency; daily reports, monthly board pack, financial close |
| Near-Real-Time | Seconds-to-minutes latency; operational dashboards, control towers, monitoring screens |
| Real-Time | Sub-second latency; fraud scoring, algorithmic trading, real-time personalisation |
| Streaming Architecture Layers | |
| Event Sources | Operational databases via CDC, IoT sensors, web and app clickstreams, transaction systems |
| Stream Brokers | Buffer and route events durably between producers and consumers |
| Stream Processors | Transform events as they flow through windowed aggregations and joins |
| Real-Time Stores | Serve aggregated state to dashboards with sub-second response |
| Stream Brokers | |
| Apache Kafka | Most widely deployed open-source distributed event broker |
| Apache Pulsar | Apache stream broker with multi-tenancy and tiered storage |
| AWS Kinesis | AWS managed event streaming service |
| Azure Event Hubs | Azure managed event streaming service |
| Google Pub/Sub | Google Cloud managed event streaming service |
| Stream Processors | |
| Apache Flink | Stateful stream processor with strong correctness guarantees |
| Spark Structured Streaming | Spark's streaming API on top of the same DataFrame engine as batch |
| Kafka Streams | Library for stream processing applications running on Kafka |
| ksqlDB | SQL streaming engine on top of Kafka |
| Real-Time Stores | |
| Materialize and RisingWave | Streaming SQL platforms with materialised views over events |
| Apache Druid | Open-source real-time analytics database with sub-second OLAP |
| ClickHouse and Pinot | Open-source columnar databases for fast analytics on streaming data |
| Time-Series Databases | InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry |
| Change Data Capture | |
| Change Data Capture | Detect and propagate database changes as they happen |
| Debezium | Open-source CDC for PostgreSQL, MySQL, Oracle, SQL Server, MongoDB |
| AWS DMS | AWS managed CDC service |
| Oracle GoldenGate | Enterprise CDC for Oracle ecosystems |
| Architecture Patterns | |
| Lambda Architecture | Parallel batch and streaming pipelines feeding the same analytical store |
| Kappa Architecture | Streaming-only pipeline with batch jobs treated as bounded streams |
| Real-Time Visualisation in BI | |
| Power BI Streaming Datasets | API-pushed events that visuals subscribe to with sub-second updates |
| Power BI DirectQuery | Live SQL against the source on every visual refresh |
| Power BI Direct Lake | Native parquet read from OneLake without extract; Fabric feature |
| Tableau Live Connection | Live SQL against the source for analyst-facing dashboards on fast warehouses |
| Grafana | Native time-series visualisation tool with auto-refresh; the operations standard |
| Kibana | Elasticsearch-backed visualisation with auto-refresh; standard for log analytics |
| Apache Superset | Open-source BI with auto-refresh and SQL Lab |
| Use Cases for Real-Time | |
| Operations Control Tower | Vehicle, asset, route status updates within seconds for logistics and operations |
| Fraud Detection | Score a transaction before it clears for retail banking and payments |
| IoT and Predictive Maintenance | Detect equipment failure within seconds of sensor anomaly for factories |
| Real-Time Personalisation | Adjust pricing or recommendations during the user's session for digital products |
| Algorithmic Trading | Sub-millisecond market-data response for capital markets |
| Live Sports and Media | Game scores, viewer counts, ad metrics for live broadcasting |
| Cybersecurity | Detect intrusions while they happen for security operations |
| Customer Support Triage | Surface a customer's full context as the call begins for service desks |
| Network Operations Centres | Telecom and infrastructure monitoring at very high event rates |
| Cost Considerations | |
| Infrastructure Cost | Always-on streaming platforms cost more than scheduled batch |
| Engineering Complexity Cost | Stream processing handles late events, exactly-once semantics, back-pressure |
| Operational Overhead | A streaming pipeline running 24/7 needs monitoring and on-call rotation |
| Use Streaming Only Where It Earns | Use streaming only where it earns the latency cost; batch otherwise |
| Materialise Aggregates | A real-time store with one-minute aggregates is cheaper than full-event storage |
| Tiered Storage | Hot recent data in fast stores, older data in cheaper batch warehouses |
| Right-Size Infrastructure | Kafka cluster scaled to actual throughput; Flink parallelism tuned to workload |
| Best Practices | |
| Validate Latency Requirement | Ask precisely what decision changes if the data is an hour old |
| Decouple via Broker | Direct database-to-dashboard connections are fragile; brokers provide durability and replay |
| Plan for Late Events | Watermarks and triggers handle late arrivals; ignore them and aggregations are wrong |
| Idempotent Consumers | Replays should not double-count; use deduplication keys or upsert semantics |
| Schema Registry | Avro or Protobuf with a registry prevents producer-consumer schema drift |
| Backfill Plan | A streaming pipeline still needs to handle historical loads, recomputed aggregates |
| Real-Time Governance | Same data-quality, lineage, access-control discipline applies to streaming as batch |
| Latency SLA | Document the latency SLA explicitly so expectations do not drift |
| Common Pitfalls | |
| Real-Time Where Batch Suffices | Pitfall of building expensive streaming infrastructure for a quarterly board pack |
| No Late-Event Handling | Pitfall of stream processor producing aggregates that lose late-arriving events |
| Schema Drift | Pitfall of producer adding a field and consumer breaking with no schema registry |
| No Replay Path | Pitfall of streaming pipeline failing with no way to backfill the lost period |
| Direct Database Hits | Pitfall of every dashboard hitting the live OLTP database and degrading it |
| Hot-Cold Mismatch | Pitfall of real-time store full of years of events when only the last hour matters |
| Streaming Without Monitoring | Pitfall of pipeline failing silently; dashboard freezes; users discover hours later |
| No Idempotency | Pitfall of replays double-counting and diverging from batch reconciliation |
| Over-Engineering | Pitfall of adopting Lambda or Kappa when a single batch ETL would meet freshness need |
| Streaming for Trivial Status | Pitfall of building real-time for status when a polling refresh would suffice |