24  Data Integration Challenges and Solutions

24.1 Why Data Integration Matters

Most analytical questions cross the boundaries between systems, and most analytical projects fail at those boundaries.

A modern organisation runs on dozens — sometimes hundreds — of systems. Sales sit in the CRM, billing in the financial-ledger system, customer support in a separate ticket platform, marketing in a campaign tool, web behaviour in an analytics platform, and operational data in plant or branch systems. Each system was built to serve one function, with its own schema, definitions, and update cycle.

The moment an analyst asks a question that touches more than one of these systems — which marketing channels produce the most profitable customers, accounting for support cost? — they encounter the central problem of analytics infrastructure: data integration. The systems do not agree on customer identifiers, the definitions of active and churn differ, the time grains are different, the data qualities are unequal, and the access controls do not match.

The visualisation consequences are immediate. Almost every dashboard a business actually cares about is a cross-source dashboard: a Customer 360 view in Tableau or Power BI joins CRM, billing, and support; a marketing-effectiveness dashboard joins campaign data, web behaviour, and revenue; an operations dashboard joins ERP, plant sensors, and supplier data. When integration is shallow, dashboards either show inconsistent numbers across panels or simply cannot be built. When integration is deep, the dashboards across the firm finally agree on basic facts — the same customer count, the same revenue, the same definition of active — and the visualisation layer becomes trustworthy.

Data integration is therefore not an IT detail; it is the operational infrastructure on which every cross-functional dashboard, BI report, and visual decision depends. The standard textbook is Principles of Data Integration by AnHai Doan et al. (2012), and the foundational theoretical treatment remains Maurizio Lenzerini (2002).

24.2 Defining Data Integration

Data Integration is the discipline of combining data from multiple, heterogeneous sources into a unified view that supports analysis and decision making. It encompasses the technical work of moving and reshaping data, the modelling work of resolving entities and definitions, and the governance work of agreeing on what each field means across the firm.

A useful working definition: data integration succeeds when a question that crosses two systems can be answered as easily as a question that lives within one.

24.3 Common Challenges

flowchart TD
    I["Data<br>Integration<br>Challenges"]
    I --> S["Schema<br>Heterogeneity"]
    I --> D["Definitional<br>Differences"]
    I --> E["Entity<br>Resolution"]
    I --> Q["Data Quality<br>Differences"]
    I --> F["Format and<br>Encoding"]
    I --> T["Timing and<br>Freshness"]
    I --> V["Volume<br>and Scale"]
    I --> P["Privacy<br>and Access"]
    style I fill:#e3f2fd,stroke:#1976D2
    style S fill:#fce4ec,stroke:#AD1457
    style D fill:#fff3e0,stroke:#EF6C00
    style E fill:#fff8e1,stroke:#F9A825
    style Q fill:#e8f5e9,stroke:#388E3C
    style F fill:#ede7f6,stroke:#4527A0
    style T fill:#f3e5f5,stroke:#6A1B9A
    style V fill:#eceff1,stroke:#455A64
    style P fill:#cfd8dc,stroke:#37474F

TipThe Eight Recurring Challenges
Challenge What It Looks Like
Schema Heterogeneity Two systems describing the same business entity with different field names, types, and structures
Definitional Differences The same field name carrying different business meaning across systems (active customer in CRM vs Finance)
Entity Resolution The same real-world entity (customer, product, supplier) represented by different identifiers in different systems
Data Quality Differences One system has high-quality data and another has missing, stale, or incorrect values for the same entities
Format and Encoding Currencies, dates, units, character encodings, and category labels expressed differently
Timing and Freshness One source refreshes daily, another weekly; joins across them carry latent inconsistencies
Volume and Scale Some systems hold millions of records, others billions; the integration architecture must handle both
Privacy and Access Control Data sensitive in one context that must remain protected when joined with data from a less restricted source

Maurizio Lenzerini (2002) cast many of these challenges in formal terms — schema mediation, entity resolution, query reformulation — and showed that they are deep problems with theoretical limits, not merely engineering inconveniences.

24.4 Integration Approaches

TipCommon Integration Patterns
Pattern Idea When to Use
ETL (Extract-Transform-Load) Extract from sources, transform in a staging environment, load into target Traditional data-warehouse loads with strict schema
ELT (Extract-Load-Transform) Extract and load raw data first, then transform inside the target Modern cloud warehouses with cheap compute and storage
Data Federation / Virtualisation Query data in place across systems without moving it Real-time, low-latency, low-volume integration
Data Lake Land raw data in a single low-cost store; transform on demand Mixed-modality data, exploratory analytics
Lakehouse Lake plus warehouse-like transactional and SQL layer Combines flexibility of lake with discipline of warehouse
Master Data Management (MDM) Maintain a single golden record per entity across systems Customer, product, supplier, employee mastering
Change Data Capture (CDC) Stream operational changes as they happen Near-real-time integration
Data Mesh Decentralised, domain-owned data products Large enterprises with high domain autonomy
API-Based Integration Source systems expose data through APIs the consumer pulls or subscribes to Cross-organisational integration

The right pattern depends on the latency requirement, the volume of data, the maturity of the source systems, and the governance regime. Most mature organisations operate two or three patterns in parallel — ELT into a cloud warehouse for analytical workloads, CDC and streaming for operational integration, and federation for selected real-time queries.

24.4.1 ETL versus ELT

The distinction between ETL and ELT, often glossed over, is consequential:

  • ETL transforms data before loading. Suited to a traditional warehouse where storage is expensive and the schema is fixed. The transformation logic sits in the ETL tool.

  • ELT loads raw data first and transforms inside the warehouse. Suited to modern cloud warehouses where storage is cheap and compute is elastic. The transformation logic sits in SQL — typically expressed in dbt — making it versionable, testable, and reviewable like application code.

The shift from ETL to ELT is one of the defining trends of the modern data stack. It does not eliminate the transformation work; it relocates it to a place where engineers and analysts can collaborate on it.

24.4.2 Master Data Management

Master Data Management (MDM) maintains a single trusted record for each business entity — customer, product, supplier, employee, location — across the firm. The MDM platform brings records together from multiple sources, applies entity-resolution rules to identify the same real-world entity, picks the most reliable values for each attribute, and serves the consolidated golden record back to the consuming systems.

Mature MDM is the single most effective answer to the entity-resolution challenge, particularly for customer master data. It also resolves much of the definitional disagreement, because the golden record carries an authoritative value for every disputed attribute.

Common MDM platforms include Informatica MDM, Reltio, Stibo Systems, Profisee, Semarchy, and SAP Master Data Governance. Open-source and lightweight alternatives include Pimcore and Akeneo for product data.

24.4.3 Change Data Capture and Streaming

Change Data Capture (CDC) detects changes in source systems as they happen — inserts, updates, deletes — and propagates them to consumers in near real time. CDC is the foundation of streaming integration architectures.

Common building blocks:

  • Kafka, Pulsar, RabbitMQ: Message brokers that buffer and route streams.
  • Debezium, Maxwell, AWS DMS, Oracle GoldenGate: CDC tools that capture changes from operational databases.
  • Spark Streaming, Flink, Kafka Streams: Stream processors that transform data in motion.
  • Materialize, Rising Wave, ksqlDB: Streaming SQL platforms.

CDC is more complex to operate than batch ETL — schema evolution, event ordering, exactly-once semantics, and downstream backfill are all non-trivial — but is essential when business questions require near-real-time integration.

24.5 Architectural Patterns

The way integrations are wired together at the architectural level falls into a few recurring patterns:

  • Point-to-Point: Each pair of systems integrated directly. Simple at first; becomes a quadratic explosion of connections as the number of systems grows.

  • Hub-and-Spoke: A central integration platform connects all sources and consumers. Linear scaling with the number of systems, at the cost of a central single point of failure.

  • Enterprise Service Bus (ESB): Message-driven hub-and-spoke with formal routing, transformation, and orchestration. Standard in service-oriented architectures.

  • Data Hub: A central data store (warehouse, lake, lakehouse) acts as the integration point; consumers read from the hub, producers write to it.

  • Data Mesh: Decentralised, with each domain operating its own data products and federating governance. Trades simplicity for autonomy.

24.6 Tools and Platforms

TipCommon Integration Tools
Category Tools
Batch ETL / ELT Informatica, Talend, IBM DataStage, Microsoft SSIS, Pentaho
Cloud-Native ELT Fivetran, Airbyte, Stitch, Hevo, Matillion
Transformation in Warehouse dbt, Dataform, SQLMesh, Coalesce
Streaming and CDC Kafka, Confluent Cloud, Debezium, AWS DMS, Oracle GoldenGate, Striim
Master Data Management Informatica MDM, Reltio, Stibo, Profisee, Semarchy, SAP MDG
Federation and Virtualisation Denodo, TIBCO Data Virtualization, Starburst, Trino
Workflow and Orchestration Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions
iPaaS (App-to-App) MuleSoft, Boomi, Workato, Informatica IICS, Microsoft Logic Apps
API Management Apigee, Kong, AWS API Gateway, Mulesoft Anypoint

The right toolkit depends on the architecture chosen, the scale of operations, the budget, and the in-house skills. Most large analytics programmes use four to six of these in combination.

24.7 Integration and the BI Layer

Data integration is the invisible infrastructure under every cross-source dashboard. Decisions about how integration is done shape what BI tools can show, how fast dashboards refresh, and whether the numbers across panels agree.

  • Pre-integrated semantic layers: Looker’s LookML, Tableau’s published data sources, and Power BI’s shared datasets express integration as a versioned semantic layer above the warehouse. Every dashboard that consumes them inherits the same definitions, joins, and filters. Without such a layer, every report builder reinvents the integration from scratch and the firm’s dashboards quietly disagree.

  • Conformed dimensions in the dashboard layer: A customer, product, date, or region dimension that is conformed across the warehouse should be conformed in the BI tool too — same identifier, same hierarchy, same display format. Inconsistent conformance is the root cause of the marketing dashboard says one number and the finance dashboard says another problems.

  • Live vs extract connections: Tableau’s live-versus-extract decision and Power BI’s DirectQuery-versus-Import choice are integration decisions in disguise. Live and DirectQuery push every dashboard query to the source; Extract and Import materialise an integrated dataset for the BI tool. The right choice depends on freshness needs, source load, and integration complexity.

  • Self-service blending versus governed joining: Tableau’s data blending and Power BI’s relationships allow analysts to join sources at the dashboard level. This is fast for prototyping but bypasses governance. Promote any blend that gets used regularly into the warehouse and the semantic layer.

  • Reverse-ETL for closing the loop: Tools such as Hightouch and Census push integrated warehouse data back into operational systems (CRM, marketing automation, customer support). This closes the integration loop: the dashboards that summarise customer behaviour drive the same operational record back into the systems where the front-line teams act.

  • Real-time integration for operational dashboards: A logistics control-tower dashboard or a fraud-monitoring screen needs sub-minute integration. CDC and streaming feed real-time visualisation tools (Grafana, Kibana, real-time Tableau, Power BI streaming datasets) where batch ETL would be hours too slow.

24.8 Common Pitfalls

  • Point-to-Point Sprawl: Every new analytical question producing a new direct integration. After a year, the firm has hundreds of brittle connections that no one fully understands.

  • Definitional Drift: Multiple integrations using the same field name with subtly different meanings. The aggregated dashboard shows numbers that no individual source agrees with.

  • Entity Resolution Skipped: Joining systems on best-effort identifier matching, accepting that 5–15 per cent of records will not match, without recording which records were lost.

  • No Master Data: Operating without a golden record for any major entity, leaving every analyst to do their own ad-hoc resolution.

  • ETL Code as Integration: Treating the ETL pipeline as the documentation of how the data flows. The pipeline becomes the only source of truth and is unmaintainable.

  • Quality Inherited Without Tagging: Joining a high-quality and a low-quality source without indicating which fields came from where, so the consumer cannot distinguish trustworthy from untrustworthy values.

  • Real-Time Where Daily Suffices: Building expensive streaming infrastructure for analytical questions that can be answered with daily batch data.

  • Daily Where Real-Time Is Needed: Building daily pipelines for operational decisions that require sub-minute freshness.

  • Ignoring Privacy on Join: Joining a privacy-restricted dataset with a less restricted one and treating the result as if the looser regime applied. Aggregation does not relax obligations.

  • No Schema Evolution Strategy: Source schemas change without notice; downstream pipelines fail silently or load wrong data.

  • Reverse-ETL as Afterthought: Investing heavily in moving data into the warehouse and ignoring the work of getting it back into the operational systems where decisions happen.

  • No Lineage or Catalogue: Consumers cannot tell where a number came from, what was done to it, or who is responsible for it.

  • Treating Integration as a Project: Integration is a continuous capability, not a project that finishes. Every new source, every schema change, every business definition update is more integration work.

24.9 Illustrative Cases

The following short cases illustrate integration challenges and the patterns used to solve them. They describe common situations and the architectural reasoning.

Customer 360 at a Retail Bank

A retail bank wants a single view of each customer combining accounts, cards, loans, investments, support tickets, and digital-channel behaviour. The integration challenges are entity resolution (the same customer carries different identifiers across product systems), definitional drift (active customer differs by product), and quality differences (legacy systems have stale addresses, new systems have current ones). The solution combines a Master Data Management platform that resolves entities and produces a golden record, a cloud warehouse loaded by ELT pipelines, and a governance council that authorises the canonical definition of every disputed term.

Operational Real-Time Dashboard for a Logistics Firm

A logistics firm runs a control-tower dashboard that must reflect vehicle, route, and exception status within seconds. Daily ETL is far too slow. The team builds a CDC-and-streaming integration: PostgreSQL changes captured by Debezium, streamed through Kafka, transformed in Flink, and served from a low-latency store to the dashboard. Batch analytical pipelines continue to run alongside, on a daily cadence, for longer-horizon work.

Supplier Data at a Manufacturing Group

A diversified manufacturing group has thirty plants, each with its own ERP, each with its own supplier master. Procurement cannot answer the simple question how much did we spend with this supplier last year across all plants. The team builds a supplier MDM, applying entity-resolution to deduplicate suppliers across the plant ERPs, with the golden record published back to a central procurement-analytics platform. The first quarter’s report identifies a consolidation opportunity worth several percent of total spend.

Cross-Domain Analytics with Federation

A technology firm has analytical data in three major platforms — Snowflake, BigQuery, and Databricks. Migrating everything to one is politically and technically infeasible. The team deploys a federation layer (Trino) that allows analysts to write SQL across all three without physically moving the data. Federation handles the on-the-fly joins; selected hot datasets are physically replicated to one platform when the federation cost becomes the bottleneck.

A Failed Point-to-Point Sprawl

A mid-size firm builds direct integrations between systems as new analytical questions arise. After two years there are 64 named integrations, no central catalogue, no consistent definitions, and several pipelines whose owners have left the company. The integration backlog is now larger than the analytical backlog. The team commissions an integration platform (a hub-and-spoke ELT with dbt for transformation and an MDM layer) and begins a multi-year programme to migrate the point-to-point pipelines to the new platform.


24.10 Hands-On Exercise: Integration in Power BI

Aim: Integrate four tables drawn from different operational sources into a single trusted analytical model in Power BI, using Merge Queries for joining at the row level and Relationships for joining at the model level.

Scenario: An analyst at Yuvijen Stores Pvt Ltd must build a customer-360 view that combines:

  • customers.csv — internal customer master (CRM extract)
  • orders.csv — internal order ledger (e-commerce platform)
  • products.csv — internal product master (inventory system)
  • regions.csv — external geographic reference (data.gov.in, with slight name spelling variants)

Each table has its own grain and its own join keys. The integrated model must allow a single dashboard to show revenue by region by product category by customer segment.

Deliverable: A Power BI file with a clean star schema, demonstrating both Merge Queries and Relationships, with at least one DAX measure that traverses all four tables.

24.10.1 Step 1 — The Sample Tables

Tipcustomers.csv (extract)
customer_id name city region_name
C-1001 Anita Sharma Mumbai Maharashtra
C-1002 Rajesh Kumar Delhi NCR Delhi
C-1003 Priya Iyer Chennai Tamil Nadu
Tiporders.csv (extract)
order_id customer_id order_date product_id quantity amount
O-2001 C-1001 2026-04-01 P-1 2 540
O-2002 C-1001 2026-04-05 P-3 1 1250
O-2003 C-1002 2026-04-02 P-2 5 425
Tipproducts.csv (extract)
product_id product_name category unit_price
P-1 Steel Tiffin Kitchen 270
P-2 Cotton Towel Bath 85
P-3 Wall Clock Home 1250
Tipregions.csv (extract — note spelling variants)
region_code region population_lakh
MH Maharashtra 1235
DL Delhi (NCR) 198
TN Tamilnadu 760

Notice the spelling variants: NCR Delhi in customers.csv versus Delhi (NCR) in regions.csv, and Tamil Nadu versus Tamilnadu. A naive join fails on these.

24.10.2 Step 2 — Load and Inspect Each Source

In Power BI Desktop:

  1. Home → Get Data → Text/CSV for each of the four files.
  2. Open the Power Query editor by clicking Transform Data.
  3. Verify each query’s schema, types, and approximate row counts.
  4. Rename queries cleanly: Customers, Orders, Products, Regions.

The four queries appear in the left-hand pane. None has been joined yet; they are still independent.

24.10.3 Step 3 — Resolve Region Spelling Variants Using Fuzzy Merge

The region name in Customers does not exactly match the region name in Regions. Power Query’s Merge Queries with fuzzy matching resolves this without scripting:

  1. With the Customers query selected, Home → Merge Queries.
  2. In the dialog, choose Regions as the second table.
  3. Select region_name from Customers and region from Regions as the join columns.
  4. Tick Use fuzzy matching to perform the merge and set similarity threshold to 0.8.
  5. Optionally configure the Match by combining text parts and Ignore case options for further tolerance.
  6. Click OK. A new column appears with matched Regions records.
  7. Expand the column to surface region_code and population_lakh.

The customer table now carries the canonical region_code from the regions table even though the original spellings disagreed. The fuzzy-merge step is recorded in the Applied Steps and will repeat reliably on every refresh.

24.10.4 Step 4 — Build the Star-Schema Skeleton

Apply the queries and switch to the Model view in Power BI Desktop. The four tables appear as boxes; the auto-detected relationships are usually a starting point.

Configure the schema explicitly:

  • Orders → Customers: many-to-one on customer_id. Single direction filter (Customers → Orders).
  • Orders → Products: many-to-one on product_id. Single direction filter (Products → Orders).
  • Customers → Regions: many-to-one on region_code. Single direction filter (Regions → Customers).

The resulting layout is a star with Orders (the fact table) at the centre and Customers, Products, Regions as dimension tables. Verify the relationship lines on the canvas — single arrowheads pointing from dimensions toward the fact.

If Power BI auto-creates a many-to-many relationship or a bidirectional filter, edit it (right-click the relationship line → Properties) to enforce the canonical star pattern.

24.10.5 Step 5 — Write Measures That Traverse the Model

In the Modeling view, create DAX measures that exercise all four tables:

Total Revenue = SUM(Orders[amount])

Revenue by Category = CALCULATE(
    [Total Revenue],
    Products[category]
)

Revenue per Lakh Population =
DIVIDE(
    [Total Revenue],
    SUM(Regions[population_lakh])
)

Top Customer by Revenue =
CONCATENATEX(
    TOPN(1, VALUES(Customers[name]), [Total Revenue]),
    Customers[name]
)

The first measure operates on the fact table alone. The second uses Products as a context. The third combines Orders and Regions, computed at whatever granularity the dashboard slicer selects. The fourth uses Customers and the fact table together. All four leverage the relationships defined in Step 4 — none requires explicit JOIN syntax.

24.10.6 Step 6 — Build a Single Dashboard That Uses All Four Sources

Switch to the Report view and build a single page demonstrating that the integration works end to end:

  • A bar chart of Revenue by Region (uses Orders + Customers + Regions).
  • A bar chart of Revenue by Category (uses Orders + Products).
  • A table showing the top five customers by revenue, with their region (uses Orders + Customers + Regions).
  • A KPI card showing Revenue per Lakh Population (uses Orders + Customers + Regions, normalised by population).

If any visual returns blank or shows unexpected zeros, the relationship direction is the most likely culprit; revisit the Model view and verify the filter direction.

24.10.7 Step 7 — Document Lineage and Save

Open Power BI’s Data Lineage view (in the Power BI Service) or the Performance Analyzer view in Desktop to confirm that each visual draws from the correct upstream queries. Save the file with a descriptive name and a version suffix; commit the source CSVs alongside.

For governance, add a short Description to each query and each measure (right-click in the Fields pane → Properties → Description). The descriptions appear as tooltips for any other team member who opens the file.

24.10.8 Step 8 — Connect to the Visualisation Layer

The integration work in this hands-on is what makes the cross-source visualisation possible:

  • Without the fuzzy merge, the dashboard’s region bar chart would show three of the four regions and silently drop the fourth.
  • Without conformed dimensions in the Model view, the customer count on one panel and the revenue on another would carry different filters.
  • Without traversal measures, the dashboard would need a separate query for each chart and the file would grow into an unmaintainable web of duplicated logic.

The hands-on illustrates the core lesson of the chapter: the dashboard’s correctness depends on the integration model behind it, not on the chart visuals on top. Time spent in the Model view repays itself many times over in the report view.

TipFiles and Screen Recordings

Power BI file (yuvijen-customer-360.pbix), the four source CSVs, and screen recordings of the fuzzy-merge step, the Model view setup, and the DAX measures will be embedded here.


Summary

Concept Description
Foundations
Why Data Integration Matters Most analytical questions cross system boundaries; integration is the operational infrastructure under cross-functional analytics
Data Integration Discipline of combining data from multiple heterogeneous sources into a unified analytical view
Eight Common Challenges
Schema Heterogeneity Two systems describing the same entity with different names, types, and structures
Definitional Differences The same field name carrying different business meaning across systems
Entity Resolution Same real-world entity represented by different identifiers in different systems
Data Quality Differences One system has high-quality data and another has missing, stale, or incorrect values
Format and Encoding Currencies, dates, units, encodings, and category labels expressed differently
Timing and Freshness Different refresh cadences across sources cause latent inconsistencies in joins
Volume and Scale Some systems hold millions of records and others billions; architecture must handle both
Privacy and Access Control Data sensitive in one context must remain protected when joined with less-restricted data
Integration Patterns
ETL Extract from sources, transform in staging, load into target; traditional warehouse pattern
ELT Extract and load raw data first, then transform inside the target; modern cloud-warehouse pattern
Data Federation Query data in place across systems without moving it; real-time low-latency integration
Data Lake Land raw data in a single low-cost store; transform on demand; mixed-modality and exploratory
Lakehouse Lake plus warehouse-like transactional and SQL layer; combines flexibility with discipline
Master Data Management Maintain a single golden record per entity across systems
Change Data Capture Stream operational changes as they happen for near-real-time integration
Data Mesh Decentralised domain-owned data products with federated governance
API-Based Integration Source systems expose data through APIs that consumers pull or subscribe to
ETL versus ELT and MDM
ETL versus ELT Distinction ETL transforms before loading; ELT loads raw and transforms in the target with elastic compute
MDM Golden Record Single trusted record per entity across the firm, served back to consuming systems
MDM Platforms Informatica MDM, Reltio, Stibo, Profisee, Semarchy, SAP MDG, with Pimcore and Akeneo for products
Streaming and CDC Building Blocks
Kafka and Pulsar Message brokers that buffer and route streams in event-driven architectures
Debezium and AWS DMS CDC tools that capture changes from operational databases
Stream Processors Spark Streaming, Flink, Kafka Streams for transforming data in motion
Streaming SQL Platforms Materialize, Rising Wave, ksqlDB for SQL over streams
Architectural Patterns
Point-to-Point Architecture Each pair of systems integrated directly; quadratic explosion as systems grow
Hub-and-Spoke Architecture Central integration platform connects all sources and consumers; linear scaling
Enterprise Service Bus Message-driven hub-and-spoke with formal routing and orchestration
Data Hub Architecture Central data store acts as integration point; consumers read from the hub
Data Mesh Architecture Decentralised with each domain owning its own data products
Tool Categories
Batch ETL Tools Informatica, Talend, DataStage, SSIS, Pentaho
Cloud-Native ELT Tools Fivetran, Airbyte, Stitch, Hevo, Matillion
Transformation in Warehouse dbt, Dataform, SQLMesh, Coalesce
Streaming and CDC Tools Kafka, Confluent, Debezium, AWS DMS, Oracle GoldenGate, Striim
MDM Tool Category Master data platforms for golden-record management
Federation and Virtualisation Denodo, TIBCO, Starburst, Trino for query without movement
Workflow Orchestration Airflow, Prefect, Dagster, Azure Data Factory, AWS Step Functions
iPaaS App-to-app integration: MuleSoft, Boomi, Workato, IICS, Logic Apps
API Management Apigee, Kong, AWS API Gateway, Mulesoft Anypoint for API governance
Common Pitfalls
Point-to-Point Sprawl Pitfall of every new question producing a new direct integration, ending in hundreds of brittle connections
Definitional Drift Pitfall of multiple integrations using the same field name with subtly different meanings
Entity Resolution Skipped Pitfall of joining on best-effort identifier matching and accepting unmatched losses without recording which
No Master Data Pitfall of operating without a golden record, leaving every analyst to resolve entities ad hoc
ETL Code as Integration Pitfall of treating the ETL pipeline as the documentation of how the data flows
Quality Inherited Without Tagging Pitfall of joining sources of different quality without indicating which fields came from where
Real-Time Where Daily Suffices Pitfall of building expensive streaming infrastructure for questions that batch could answer
Daily Where Real-Time Is Needed Pitfall of building daily pipelines for decisions that need sub-minute freshness
Ignoring Privacy on Join Pitfall of joining privacy-restricted data with looser data and treating the result by the looser rules
No Schema Evolution Strategy Pitfall of source schemas changing without notice and downstream pipelines failing silently
Reverse-ETL as Afterthought Pitfall of investing in moving data into the warehouse and ignoring how to get it back to operational systems
No Lineage or Catalogue Pitfall of consumers unable to tell where a number came from or who is responsible for it
Integration Treated as a Project Pitfall of treating integration as a project that finishes rather than a continuous capability