28  Correlation Analysis and Relationship Identification

28.1 Why Correlation Analysis Matters

Correlation does not imply causation, but it does mean something, and ignoring it leaves money on the table.

A central question in any analytical project is whether two variables move together — and if they do, in what direction, how strongly, and what the practical implications are. Correlation analysis is the toolkit that quantifies this, and it underpins much of the relationship-finding work analysts do before they reach for any predictive model.

The mathematical foundation was laid by Karl Pearson (Karl Pearson, 1895) in his treatment of regression and correlation between paired variables. Three quarters of a century later, Frank Anscombe issued the famous warning that correlation coefficients alone are insufficient: in Graphs in Statistical Analysis (Francis J. Anscombe, 1973), he constructed four very different datasets with identical means, variances, and correlations, demonstrating that correlation must always be checked visually, never relied on numerically alone.

For a visualisation-focused book, this is the central lesson of correlation analysis: the scatter plot, not the correlation coefficient, is the primary output. The number is a useful summary; the chart is the truth check.

28.2 Defining Correlation

Correlation is a statistical measure of the strength and direction of a linear (or in some forms, monotonic) relationship between two variables. The coefficient ranges from −1 to +1, where:

  • +1 is a perfect positive linear relationship.
  • 0 is no linear relationship.
  • −1 is a perfect negative linear relationship.

Correlation does not measure causation. Two variables can be highly correlated because one causes the other, because both are caused by a third variable, or by coincidence in a small sample.

28.3 Three Common Correlation Coefficients

TipCommon Correlation Coefficients
Coefficient When to Use Range Notes
Pearson’s r Both variables continuous, relationship roughly linear −1 to +1 The standard for most business work; sensitive to outliers
Spearman’s ρ Variables ordinal or non-linear monotonic relationship −1 to +1 Operates on ranks; outlier-resistant
Kendall’s τ Small samples, ordinal data −1 to +1 Computationally heavier; preferred in academic work
Phi coefficient (φ) Two binary variables −1 to +1 Special case of Pearson on dichotomous data
Point-biserial One continuous, one binary −1 to +1 Special case of Pearson

For most exploratory work in business analytics, Pearson’s r is the default. Spearman’s ρ is the right choice when the relationship is monotonic but not linear, or when the data has substantial outliers.

28.4 Anscombe’s Quartet — Why You Always Visualise

Francis J. Anscombe (1973) constructed four small datasets with identical means, variances, correlations (all approximately 0.816), and best-fit regression lines — yet they look completely different when plotted:

  • Dataset I: A genuine linear relationship with random scatter.
  • Dataset II: A clean curvilinear relationship; a parabola, not a line.
  • Dataset III: A perfect linear relationship pulled off course by a single outlier.
  • Dataset IV: All x-values identical except one outlier; the correlation is driven entirely by that single point.

The lesson: a correlation coefficient is a single number summary of a two-dimensional reality. Always look at the scatter plot before reporting the coefficient.

28.5 Correlation Versus Causation

Strong correlation can arise from any of:

  • A causes B — the relationship the analyst is looking for.
  • B causes A — reverse causation; intervention on A would not change B.
  • Both A and B are caused by C — confounding; the apparent relationship is spurious.
  • Coincidence — particularly in small samples or after many tests.

The classic illustration: monthly ice-cream sales correlate strongly with monthly drowning deaths. Neither causes the other; both are caused by summer.

The visualisation discipline: when reporting a correlation, always disclose the sample size, the time horizon, and any obvious confounders, and prefer multiple supporting visuals (scatter, time-series of both variables, residual plot) over a single coefficient.

28.6 Correlation in Multivariate Settings

For three or more variables:

  • Correlation matrix: A table of pairwise correlations. Visualised as a heatmap with diverging colour palette (positive blue, negative red, zero white).
  • Scatter-plot matrix (SPLOM): A grid of pairwise scatter plots; the visual companion to the correlation matrix.
  • Partial correlation: The correlation between two variables after removing the linear effect of a third. Useful for disentangling confounded relationships.
  • Network correlation graph: A graph in which nodes are variables and edges are strong correlations. Useful for very high-dimensional datasets.

28.7 Tools for Correlation in Excel and Power BI

  • Excel CORREL: =CORREL(range1, range2) returns Pearson’s r between two columns.
  • Excel Analysis ToolPak: Data → Data Analysis → Correlation produces a full correlation matrix in one click for any number of variables.
  • Excel scatter plot: Insert → Scatter; add a Linear Trendline and tick Display R-squared value on chart to see r² directly.
  • Power BI scatter plot: Drag two numeric variables to X-Axis and Y-Axis; add a category to Details for one mark per entity. Add a trend line via Format → Analytics → Trend Line.
  • Power BI Key Influencers: Identifies which variables most influence a target measure, with effect-size quantification — a richer, ML-based take on the correlation question.
  • Tableau: Drag two measures to Rows and Columns, set both to dimension; add a trend line via Analytics → Trend Line.

28.8 Common Pitfalls

  • Anscombe Blindness: Reporting r without examining the scatter plot.
  • Causation Claims: Treating correlation as evidence of causation.
  • Outlier Domination: A single extreme point producing a high correlation that disappears when removed.
  • Restricted Range: Correlation computed on a narrow slice of the data is much weaker than on the full range; common in segmented analyses.
  • Aggregation Bias: Correlations at the aggregate level (regions) do not equal correlations at the individual level (customers); the ecological fallacy.
  • Multiple Testing: Computing dozens of pairwise correlations and reporting the highest as if it were significant.
  • Wrong Coefficient Choice: Using Pearson on ordinal data or non-linear monotonic relationships where Spearman is more appropriate.

28.9 Illustrative Cases

Marketing Spend and Revenue

A consumer-goods firm correlates monthly marketing spend with monthly revenue and reports r = 0.84. Plotting the data reveals a strong outlier — a single product launch month — driving the correlation. Without the launch month, r drops to 0.41.

Delivery Time and NPS

An operations team plots delivery hours against NPS and finds a Pearson correlation of −0.55. Spearman’s ρ is −0.72. The Spearman is the better measure here because the relationship is monotonic but not linear — improvements in delivery time matter more at the slow end of the distribution than at the fast end.


28.10 Hands-On Exercise: Correlation Analysis in Excel and Power BI

Aim: Compute correlations using Excel’s CORREL function and Analysis ToolPak; visualise the relationships using Power BI scatter plots and Key Influencers; and demonstrate why visual inspection is non-negotiable using a small Anscombe-style example.

Scenario: A 24-month dataset for Yuvijen Stores containing monthly revenue, marketing spend, NPS, fulfilment rate, delivery hours, and a public weather index for the firm’s primary market.

Deliverable: An Excel correlation matrix, a Power BI scatter-plot grid, and a Key Influencers analysis.

28.10.1 Step 1 — Excel CORREL for Pairwise Correlations

For two variables at a time:

=CORREL(B2:B25, C2:C25)

Build a small correlation block by computing one CORREL per pair. The result for the sample dataset might look like:

Pair Pearson r
Revenue vs Marketing Spend 0.78
Revenue vs NPS 0.45
Delivery Hours vs NPS -0.62
Marketing Spend vs Weather Index 0.07

28.10.2 Step 2 — Excel Analysis ToolPak Correlation Matrix

For the full pairwise matrix in one click:

  1. Data → Data Analysis → Correlation.
  2. Select the entire numeric data range.
  3. Tick Labels in first row, output to a new sheet.
  4. The output is a lower-triangular correlation matrix with one row and column per variable.

For visualisation, apply Conditional Formatting → Color Scales to the matrix using a diverging palette (red-white-blue). The matrix becomes a quick-read heatmap of relationships.

28.10.3 Step 3 — Power BI Scatter Plot with Trend Line

  1. Insert a Scatter chart visual.
  2. Drag delivery_hours to X-Axis and nps to Y-Axis.
  3. Drag month to Details so each point represents one month.
  4. Format → Analytics → Trend Line → On, and read R² from the line tooltip.
  5. To compare multiple pairs side by side, build a small-multiple grid by repeating this for (marketing spend, revenue), (fulfilment, nps), and (weather index, revenue) on the same page.

The visual tells the analyst what the coefficient cannot — whether the relationship is linear, whether outliers dominate, whether there is heteroscedasticity.

28.10.4 Step 4 — Power BI Key Influencers

  1. Insert the Key Influencers visual.
  2. Drag revenue to Analyze.
  3. Drag marketing_spend, nps, fulfilment_pct, delivery_hours, weather_index to Explain by.
  4. The visual surfaces which variables most increase (or decrease) revenue, with effect sizes.

Key Influencers is a richer alternative to a flat correlation matrix because it considers each variable’s effect after controlling for the others — closer to a partial-correlation view.

28.10.5 Step 5 — Anscombe Demonstration

Reproduce Anscombe’s quartet in Excel:

x y1 y2 y3 y4 (x4) y4
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.10 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.10 5.39 19 12.50
12 10.84 9.13 8.15 8 5.56
7 4.82 7.26 6.42 8 7.91
5 5.68 4.74 5.73 8 6.89

Compute Pearson’s r for each (x, y) pair — all four return approximately 0.816 with virtually identical means and standard deviations. Then plot each pair on a scatter chart. The four charts look completely different. The lesson — always plot — becomes self-evident.

28.10.6 Step 6 — Connect to the Visualisation Layer

The hands-on closes with a clear visualisation discipline:

  • A correlation reported in a dashboard should always be accompanied by the scatter plot that produced it.
  • A correlation matrix shown as a numerical table is harder to read than the same matrix as a heatmap.
  • Key Influencers visuals are more honest than flat correlations because they account for partial relationships, but should still be supplemented with the scatter plots that show the underlying shape.
TipFiles and Screen Recordings

Excel workbook (yuvijen-correlation.xlsx), Power BI file (yuvijen-correlation.pbix), Anscombe quartet sheet, and screen recordings of CORREL, Analysis ToolPak, scatter trend lines, and Key Influencers will be embedded here.


Summary

Concept Description
Foundations
Why Correlation Matters Correlation does not imply causation but it does mean something; ignoring it leaves money on the table
Correlation Statistical measure of the strength and direction of a linear or monotonic relationship
Correlation Coefficients
Pearson's r Standard for continuous linear relationships; sensitive to outliers
Spearman's rho Operates on ranks; for non-linear monotonic or ordinal data; outlier-resistant
Kendall's tau Small samples and ordinal data; computationally heavier; preferred in academic work
Phi Coefficient Special case of Pearson on two binary variables
Point-Biserial Special case of Pearson with one continuous and one binary variable
Anscombe's Quartet
Anscombe's Quartet Four datasets with identical r yet very different shapes; always plot before reporting
Correlation versus Causation
Forward Causation A causes B; the relationship the analyst is looking for
Reverse Causation B causes A; intervention on A would not change B
Confounding Both caused by C; the apparent relationship is spurious
Coincidence Particularly in small samples or after many tests
Multivariate Correlation
Correlation Matrix Table of pairwise correlations visualised as a heatmap
Scatter-Plot Matrix Grid of pairwise scatter plots; visual companion to the correlation matrix
Partial Correlation Correlation between two variables after removing the linear effect of a third
Network Correlation Graph Graph where nodes are variables and edges are strong correlations
Tools
Excel CORREL Two-range function returning Pearson's r between any two columns
Excel Analysis ToolPak Correlation Data Analysis Correlation produces a full pairwise matrix in one click
Excel Scatter with Trendline Insert Scatter then Linear Trendline with Display R-squared value
Power BI Scatter Plot Drag X and Y axis variables; add trend line via Analytics pane
Power BI Key Influencers ML-based richer alternative to flat correlation considering effects after controls
Tableau Trend Line Drag two measures, set as dimensions, add trend line via Analytics
Common Pitfalls
Anscombe Blindness Pitfall of reporting r without examining the scatter plot
Causation Claims Pitfall of treating correlation as evidence of causation
Outlier Domination Pitfall of a single extreme point producing a high correlation that disappears when removed
Restricted Range Pitfall of correlation on a narrow slice being much weaker than on the full range
Aggregation Bias Pitfall of treating aggregate-level correlations as individual-level relationships
Multiple Testing Pitfall of computing dozens of correlations and reporting the highest as if significant
Wrong Coefficient Choice Pitfall of using Pearson on ordinal or non-linear monotonic data where Spearman is appropriate