| Concept | Description |
|---|---|
| Foundations | |
| Why Correlation Matters | Correlation does not imply causation but it does mean something; ignoring it leaves money on the table |
| Correlation | Statistical measure of the strength and direction of a linear or monotonic relationship |
| Correlation Coefficients | |
| Pearson's r | Standard for continuous linear relationships; sensitive to outliers |
| Spearman's rho | Operates on ranks; for non-linear monotonic or ordinal data; outlier-resistant |
| Kendall's tau | Small samples and ordinal data; computationally heavier; preferred in academic work |
| Phi Coefficient | Special case of Pearson on two binary variables |
| Point-Biserial | Special case of Pearson with one continuous and one binary variable |
| Anscombe's Quartet | |
| Anscombe's Quartet | Four datasets with identical r yet very different shapes; always plot before reporting |
| Correlation versus Causation | |
| Forward Causation | A causes B; the relationship the analyst is looking for |
| Reverse Causation | B causes A; intervention on A would not change B |
| Confounding | Both caused by C; the apparent relationship is spurious |
| Coincidence | Particularly in small samples or after many tests |
| Multivariate Correlation | |
| Correlation Matrix | Table of pairwise correlations visualised as a heatmap |
| Scatter-Plot Matrix | Grid of pairwise scatter plots; visual companion to the correlation matrix |
| Partial Correlation | Correlation between two variables after removing the linear effect of a third |
| Network Correlation Graph | Graph where nodes are variables and edges are strong correlations |
| Tools | |
| Excel CORREL | Two-range function returning Pearson's r between any two columns |
| Excel Analysis ToolPak Correlation | Data Analysis Correlation produces a full pairwise matrix in one click |
| Excel Scatter with Trendline | Insert Scatter then Linear Trendline with Display R-squared value |
| Power BI Scatter Plot | Drag X and Y axis variables; add trend line via Analytics pane |
| Power BI Key Influencers | ML-based richer alternative to flat correlation considering effects after controls |
| Tableau Trend Line | Drag two measures, set as dimensions, add trend line via Analytics |
| Common Pitfalls | |
| Anscombe Blindness | Pitfall of reporting r without examining the scatter plot |
| Causation Claims | Pitfall of treating correlation as evidence of causation |
| Outlier Domination | Pitfall of a single extreme point producing a high correlation that disappears when removed |
| Restricted Range | Pitfall of correlation on a narrow slice being much weaker than on the full range |
| Aggregation Bias | Pitfall of treating aggregate-level correlations as individual-level relationships |
| Multiple Testing | Pitfall of computing dozens of correlations and reporting the highest as if significant |
| Wrong Coefficient Choice | Pitfall of using Pearson on ordinal or non-linear monotonic data where Spearman is appropriate |
28 Correlation Analysis and Relationship Identification
28.1 Why Correlation Analysis Matters
Correlation does not imply causation, but it does mean something, and ignoring it leaves money on the table.
A central question in any analytical project is whether two variables move together — and if they do, in what direction, how strongly, and what the practical implications are. Correlation analysis is the toolkit that quantifies this, and it underpins much of the relationship-finding work analysts do before they reach for any predictive model.
The mathematical foundation was laid by Karl Pearson (Karl Pearson, 1895) in his treatment of regression and correlation between paired variables. Three quarters of a century later, Frank Anscombe issued the famous warning that correlation coefficients alone are insufficient: in Graphs in Statistical Analysis (Francis J. Anscombe, 1973), he constructed four very different datasets with identical means, variances, and correlations, demonstrating that correlation must always be checked visually, never relied on numerically alone.
For a visualisation-focused book, this is the central lesson of correlation analysis: the scatter plot, not the correlation coefficient, is the primary output. The number is a useful summary; the chart is the truth check.
28.2 Defining Correlation
Correlation is a statistical measure of the strength and direction of a linear (or in some forms, monotonic) relationship between two variables. The coefficient ranges from −1 to +1, where:
- +1 is a perfect positive linear relationship.
- 0 is no linear relationship.
- −1 is a perfect negative linear relationship.
Correlation does not measure causation. Two variables can be highly correlated because one causes the other, because both are caused by a third variable, or by coincidence in a small sample.
28.3 Three Common Correlation Coefficients
| Coefficient | When to Use | Range | Notes |
|---|---|---|---|
| Pearson’s r | Both variables continuous, relationship roughly linear | −1 to +1 | The standard for most business work; sensitive to outliers |
| Spearman’s ρ | Variables ordinal or non-linear monotonic relationship | −1 to +1 | Operates on ranks; outlier-resistant |
| Kendall’s τ | Small samples, ordinal data | −1 to +1 | Computationally heavier; preferred in academic work |
| Phi coefficient (φ) | Two binary variables | −1 to +1 | Special case of Pearson on dichotomous data |
| Point-biserial | One continuous, one binary | −1 to +1 | Special case of Pearson |
For most exploratory work in business analytics, Pearson’s r is the default. Spearman’s ρ is the right choice when the relationship is monotonic but not linear, or when the data has substantial outliers.
28.4 Anscombe’s Quartet — Why You Always Visualise
Francis J. Anscombe (1973) constructed four small datasets with identical means, variances, correlations (all approximately 0.816), and best-fit regression lines — yet they look completely different when plotted:
- Dataset I: A genuine linear relationship with random scatter.
- Dataset II: A clean curvilinear relationship; a parabola, not a line.
- Dataset III: A perfect linear relationship pulled off course by a single outlier.
- Dataset IV: All x-values identical except one outlier; the correlation is driven entirely by that single point.
The lesson: a correlation coefficient is a single number summary of a two-dimensional reality. Always look at the scatter plot before reporting the coefficient.
28.5 Correlation Versus Causation
Strong correlation can arise from any of:
- A causes B — the relationship the analyst is looking for.
- B causes A — reverse causation; intervention on A would not change B.
- Both A and B are caused by C — confounding; the apparent relationship is spurious.
- Coincidence — particularly in small samples or after many tests.
The classic illustration: monthly ice-cream sales correlate strongly with monthly drowning deaths. Neither causes the other; both are caused by summer.
The visualisation discipline: when reporting a correlation, always disclose the sample size, the time horizon, and any obvious confounders, and prefer multiple supporting visuals (scatter, time-series of both variables, residual plot) over a single coefficient.
28.6 Correlation in Multivariate Settings
For three or more variables:
- Correlation matrix: A table of pairwise correlations. Visualised as a heatmap with diverging colour palette (positive blue, negative red, zero white).
- Scatter-plot matrix (SPLOM): A grid of pairwise scatter plots; the visual companion to the correlation matrix.
- Partial correlation: The correlation between two variables after removing the linear effect of a third. Useful for disentangling confounded relationships.
- Network correlation graph: A graph in which nodes are variables and edges are strong correlations. Useful for very high-dimensional datasets.
28.7 Tools for Correlation in Excel and Power BI
-
Excel CORREL:
=CORREL(range1, range2)returns Pearson’s r between two columns. - Excel Analysis ToolPak: Data → Data Analysis → Correlation produces a full correlation matrix in one click for any number of variables.
- Excel scatter plot: Insert → Scatter; add a Linear Trendline and tick Display R-squared value on chart to see r² directly.
- Power BI scatter plot: Drag two numeric variables to X-Axis and Y-Axis; add a category to Details for one mark per entity. Add a trend line via Format → Analytics → Trend Line.
- Power BI Key Influencers: Identifies which variables most influence a target measure, with effect-size quantification — a richer, ML-based take on the correlation question.
- Tableau: Drag two measures to Rows and Columns, set both to dimension; add a trend line via Analytics → Trend Line.
28.8 Common Pitfalls
- Anscombe Blindness: Reporting r without examining the scatter plot.
- Causation Claims: Treating correlation as evidence of causation.
- Outlier Domination: A single extreme point producing a high correlation that disappears when removed.
- Restricted Range: Correlation computed on a narrow slice of the data is much weaker than on the full range; common in segmented analyses.
- Aggregation Bias: Correlations at the aggregate level (regions) do not equal correlations at the individual level (customers); the ecological fallacy.
- Multiple Testing: Computing dozens of pairwise correlations and reporting the highest as if it were significant.
- Wrong Coefficient Choice: Using Pearson on ordinal data or non-linear monotonic relationships where Spearman is more appropriate.
28.9 Illustrative Cases
Marketing Spend and Revenue
A consumer-goods firm correlates monthly marketing spend with monthly revenue and reports r = 0.84. Plotting the data reveals a strong outlier — a single product launch month — driving the correlation. Without the launch month, r drops to 0.41.
Delivery Time and NPS
An operations team plots delivery hours against NPS and finds a Pearson correlation of −0.55. Spearman’s ρ is −0.72. The Spearman is the better measure here because the relationship is monotonic but not linear — improvements in delivery time matter more at the slow end of the distribution than at the fast end.
28.10 Hands-On Exercise: Correlation Analysis in Excel and Power BI
Aim: Compute correlations using Excel’s CORREL function and Analysis ToolPak; visualise the relationships using Power BI scatter plots and Key Influencers; and demonstrate why visual inspection is non-negotiable using a small Anscombe-style example.
Scenario: A 24-month dataset for Yuvijen Stores containing monthly revenue, marketing spend, NPS, fulfilment rate, delivery hours, and a public weather index for the firm’s primary market.
Deliverable: An Excel correlation matrix, a Power BI scatter-plot grid, and a Key Influencers analysis.
28.10.1 Step 1 — Excel CORREL for Pairwise Correlations
For two variables at a time:
=CORREL(B2:B25, C2:C25)
Build a small correlation block by computing one CORREL per pair. The result for the sample dataset might look like:
| Pair | Pearson r |
|---|---|
| Revenue vs Marketing Spend | 0.78 |
| Revenue vs NPS | 0.45 |
| Delivery Hours vs NPS | -0.62 |
| Marketing Spend vs Weather Index | 0.07 |
28.10.2 Step 2 — Excel Analysis ToolPak Correlation Matrix
For the full pairwise matrix in one click:
- Data → Data Analysis → Correlation.
- Select the entire numeric data range.
- Tick Labels in first row, output to a new sheet.
- The output is a lower-triangular correlation matrix with one row and column per variable.
For visualisation, apply Conditional Formatting → Color Scales to the matrix using a diverging palette (red-white-blue). The matrix becomes a quick-read heatmap of relationships.
28.10.3 Step 3 — Power BI Scatter Plot with Trend Line
- Insert a Scatter chart visual.
- Drag delivery_hours to X-Axis and nps to Y-Axis.
- Drag month to Details so each point represents one month.
- Format → Analytics → Trend Line → On, and read R² from the line tooltip.
- To compare multiple pairs side by side, build a small-multiple grid by repeating this for (marketing spend, revenue), (fulfilment, nps), and (weather index, revenue) on the same page.
The visual tells the analyst what the coefficient cannot — whether the relationship is linear, whether outliers dominate, whether there is heteroscedasticity.
28.10.4 Step 4 — Power BI Key Influencers
- Insert the Key Influencers visual.
- Drag revenue to Analyze.
- Drag marketing_spend, nps, fulfilment_pct, delivery_hours, weather_index to Explain by.
- The visual surfaces which variables most increase (or decrease) revenue, with effect sizes.
Key Influencers is a richer alternative to a flat correlation matrix because it considers each variable’s effect after controlling for the others — closer to a partial-correlation view.
28.10.5 Step 5 — Anscombe Demonstration
Reproduce Anscombe’s quartet in Excel:
| x | y1 | y2 | y3 | y4 (x4) | y4 |
|---|---|---|---|---|---|
| 10 | 8.04 | 9.14 | 7.46 | 8 | 6.58 |
| 8 | 6.95 | 8.14 | 6.77 | 8 | 5.76 |
| 13 | 7.58 | 8.74 | 12.74 | 8 | 7.71 |
| 9 | 8.81 | 8.77 | 7.11 | 8 | 8.84 |
| 11 | 8.33 | 9.26 | 7.81 | 8 | 8.47 |
| 14 | 9.96 | 8.10 | 8.84 | 8 | 7.04 |
| 6 | 7.24 | 6.13 | 6.08 | 8 | 5.25 |
| 4 | 4.26 | 3.10 | 5.39 | 19 | 12.50 |
| 12 | 10.84 | 9.13 | 8.15 | 8 | 5.56 |
| 7 | 4.82 | 7.26 | 6.42 | 8 | 7.91 |
| 5 | 5.68 | 4.74 | 5.73 | 8 | 6.89 |
Compute Pearson’s r for each (x, y) pair — all four return approximately 0.816 with virtually identical means and standard deviations. Then plot each pair on a scatter chart. The four charts look completely different. The lesson — always plot — becomes self-evident.
28.10.6 Step 6 — Connect to the Visualisation Layer
The hands-on closes with a clear visualisation discipline:
- A correlation reported in a dashboard should always be accompanied by the scatter plot that produced it.
- A correlation matrix shown as a numerical table is harder to read than the same matrix as a heatmap.
- Key Influencers visuals are more honest than flat correlations because they account for partial relationships, but should still be supplemented with the scatter plots that show the underlying shape.
Excel workbook (yuvijen-correlation.xlsx), Power BI file (yuvijen-correlation.pbix), Anscombe quartet sheet, and screen recordings of CORREL, Analysis ToolPak, scatter trend lines, and Key Influencers will be embedded here.