27  Statistical Profiling and Pattern Recognition

27.1 Why Statistical Profiling Matters

A picture of the data is incomplete without the numbers behind it.

Where Chapter 26 introduced the broad EDA workflow, this chapter focuses specifically on the statistical profile of a variable — the numerical measures of its centre, spread, shape, and tail behaviour — and on the pattern recognition that follows from that profile.

A statistical profile answers questions a single chart cannot: how variable is the data, how skewed is the distribution, how heavy are the tails, where do the quartiles fall, how does this variable’s distribution compare with another’s? Cleveland’s Visualizing Data (William S. Cleveland, 1993) established that statistical profiling and graphical methods together — never apart — are the foundation of analytical insight, and David J. Hand et al. (2001) codified pattern-recognition methods as a distinct sub-discipline of data mining.

For a visualisation-focused book, statistical profiling is what allows the analyst to choose chart parameters wisely — bin widths for histograms, axis ranges for box plots, colour breakpoints for choropleths — and to detect patterns that the eye alone may miss.

27.2 Measures of Central Tendency, Dispersion, and Shape

TipThe Statistical Profile of a Numerical Variable
Aspect Measure What It Tells
Central tendency Mean Arithmetic average; sensitive to outliers
Median Middle value; robust to outliers
Mode Most frequent value; useful for discrete or categorical
Dispersion Range Maximum minus minimum
Variance Average squared deviation from the mean
Standard deviation Square root of variance; same units as the variable
Interquartile range (IQR) Spread of the middle 50 %
Coefficient of variation Standard deviation / mean; unitless relative spread
Shape Skewness Asymmetry of the distribution
Kurtosis Tail heaviness compared with a normal distribution
Position Percentiles Value at a given rank
Quartiles 25th, 50th, 75th percentiles

A complete profile reports each of these for every numerical variable. The discipline is to read them together: a high skewness with a moderate standard deviation tells a different story from a low skewness with a high standard deviation.

27.3 Pattern Recognition Methods

Beyond single-variable profiling, pattern recognition examines combinations of variables and identifies recurring structures. The major method families:

  • Clustering: Groups records that resemble each other. K-means, hierarchical clustering, DBSCAN.
  • Classification rules: If-then patterns associated with a target variable. Decision trees, association rule mining.
  • Anomaly detection: Records that do not fit any pattern. Isolation Forest, Local Outlier Factor.
  • Sequence and trajectory mining: Recurring patterns over time or sequence. Used for customer-journey analysis.
  • Visual pattern detection: Patterns that the eye notices when the data is properly visualised — clusters in a scatter plot, gaps in a time series, modes in a histogram.

For most business analysts, the primary pattern-recognition tools are clustering (segmenting customers, products, or regions) and visual inspection (the eye remains the most flexible pattern detector when the data is properly displayed).

27.4 Statistical Profiling in Excel and Power BI

Both tools support statistical profiling out of the box:

  • Excel Analysis ToolPak: Data → Data Analysis → Descriptive Statistics produces a one-click profile (mean, median, mode, standard deviation, kurtosis, skewness, range, minimum, maximum, sum, count, confidence interval).
  • Excel built-in functions: AVERAGE, MEDIAN, MODE.SNGL, STDEV.S, VAR.S, SKEW, KURT, PERCENTILE.INC, QUARTILE.INC.
  • Power BI DAX measures: AVERAGE, MEDIAN, STDEV.S, VAR.S, MIN, MAX, PERCENTILE.INC. (SKEW and KURT are not built-in DAX functions; analysts compute them via formula or use Power Query M).
  • Power Query Column Profile: Visual profile of distribution, missing count, distinct count, and value frequency for each column.
  • Tableau Summary Card: Worksheet → Show Summary; surfaces count, sum, average, min, max, median, standard deviation.

27.5 Common Pitfalls

  • Mean Without Median: Reporting only the mean for a skewed distribution; the median tells a different and often truer story.
  • Standard Deviation Without Distribution: Reporting standard deviation for a distribution that is not approximately normal; the figure is uninformative.
  • No Skewness or Kurtosis: Skipping shape statistics for distributions that have heavy tails or are bimodal.
  • Anscombe Blindness: Treating identical summary statistics as identical data; Anscombe’s quartet shows four very different datasets with identical means, variances, and correlations.
  • Pattern Without Significance: Reporting clusters or rules without testing whether they are stronger than would be expected by chance.

27.6 Illustrative Cases

Customer Segmentation Reveals Three Genuine Clusters

A retail analyst clusters customers on recency, frequency, and monetary value (RFM). The k-means analysis identifies three clusters that map cleanly onto Premium, Regular, and Lapsed segments. The cluster centroids’ descriptive statistics let the marketing team set distinct targets for each segment.

A Sales Dataset Hides Bimodality Behind a Mean

A sales team reports an average order value of ₹1,800. EDA reveals the distribution is bimodal — a peak at ₹400 (small purchases) and another at ₹3,200 (large purchases) — with very few orders near the mean. The mean misrepresents the typical customer and the team rebuilds its forecasts using two separate distributions.


27.7 Hands-On Exercise: Statistical Profiling in Excel and Power BI

Aim: Produce a complete statistical profile of two key variables — transaction_amount and delivery_time — using both Excel’s Analysis ToolPak and Power BI’s DAX measures, and surface patterns using Power BI’s Decomposition Tree and Key Influencers visuals.

Scenario: The same retail dataset for Yuvijen Stores used in Chapter 26.

Deliverable: An Excel descriptive-statistics sheet, a Power BI profile page, and a short patterns page showing decomposition and key-influencer findings.

27.7.1 Step 1 — Excel Analysis ToolPak

  1. Enable the ToolPak: File → Options → Add-ins → Excel Add-ins → Go → Analysis ToolPak.
  2. Open sales_eda.csv in Excel.
  3. Data → Data Analysis → Descriptive Statistics.
  4. Select the transaction_amount range. Tick Summary statistics, Confidence Level for Mean: 95 %, Kth Largest: 1, Kth Smallest: 1.
  5. The output table contains mean, standard error, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, count, confidence interval.
  6. Repeat for delivery_time.

The Analysis ToolPak produces in one click what would otherwise take fifteen separate formulas.

27.7.2 Step 2 — Excel Built-In Functions

For finer control or for measures the ToolPak omits:

=AVERAGE(B2:B5001)
=MEDIAN(B2:B5001)
=STDEV.S(B2:B5001)
=SKEW(B2:B5001)
=KURT(B2:B5001)
=PERCENTILE.INC(B2:B5001, 0.25)
=PERCENTILE.INC(B2:B5001, 0.75)
=PERCENTILE.INC(B2:B5001, 0.75) - PERCENTILE.INC(B2:B5001, 0.25)

The last formula computes the IQR. Compare the IQR with the standard deviation: if they disagree sharply, the distribution is non-normal and the standard deviation should be supplemented with the IQR in any reporting.

27.7.3 Step 3 — Power BI DAX Measures

In Power BI, replicate the same statistics as DAX measures:

Avg Amount = AVERAGE(Sales[amount])
Median Amount = MEDIAN(Sales[amount])
StdDev Amount = STDEV.S(Sales[amount])
Min Amount = MIN(Sales[amount])
Max Amount = MAX(Sales[amount])
P25 Amount = PERCENTILE.INC(Sales[amount], 0.25)
P75 Amount = PERCENTILE.INC(Sales[amount], 0.75)
IQR Amount = [P75 Amount] - [P25 Amount]

Display these on a card-strip across the top of the profile page. Each card carries one measure plus a small sparkline of the variable over time.

27.7.4 Step 4 — Visual Profile

Build the visual profile alongside the numerical one:

  • Histogram of amount using a binned column chart in Power BI (or a built-in histogram in Excel via Insert → Statistic Chart → Histogram).
  • Box plot of amount by category in Tableau (or Power BI’s Vega-based custom visual; alternatively use a binned bar chart approximation).
  • Q-Q plot in Excel via the Analysis ToolPak → Random Number Generation and a paired sort, or in Power BI via DAX-computed quantile pairs.

The visual profile shows what the numerical profile cannot — for example, a bimodal distribution that the mean and standard deviation alone cannot reveal.

27.7.5 Step 5 — Pattern Recognition with Power BI

  • Decomposition Tree: Drag Sum of Amount to Analyze and category, segment, store_id, payment_method to Explain by. Click + on the highest-value node to drill down through the dimensions automatically.
  • Key Influencers: Drag amount to Analyze and the categorical variables to Explain by. The visual identifies the categories most associated with high or low amount and quantifies the effect.
  • Q&A: Type what are the top three product categories by amount in 2025; Power BI generates the chart.

27.7.6 Step 6 — Connect to the Visualisation Layer

The statistical profile guides chart parameters:

  • A high IQR-to-standard-deviation ratio suggests a box plot is more honest than a bar of means.
  • A high kurtosis (heavy tails) suggests a log-axis on a line chart.
  • A strong key-influencer pattern suggests a small-multiple or facet by that variable.

The numbers from this hands-on are not just diagnostic — they tell the analyst what the dashboard should look like.

TipFiles and Screen Recordings

Excel workbook (yuvijen-stat-profile.xlsx), Power BI file (yuvijen-stat-profile.pbix), the source data, and screen recordings of the Analysis ToolPak, DAX measures, Decomposition Tree, and Key Influencers will be embedded here.


Summary

Concept Description
Foundations
Why Statistical Profiling Matters A picture of the data is incomplete without the numbers behind it; profiling guides chart parameters
Central Tendency
Mean Arithmetic average; sensitive to outliers
Median Middle value; robust to outliers
Mode Most frequent value; useful for discrete or categorical
Dispersion
Range Maximum minus minimum
Variance Average squared deviation from the mean
Standard Deviation Square root of variance; same units as the variable
Interquartile Range Spread of the middle fifty per cent of values
Coefficient of Variation Standard deviation divided by mean; unitless relative spread
Shape and Position
Skewness Asymmetry of the distribution
Kurtosis Tail heaviness compared with a normal distribution
Percentiles Value at a given rank
Quartiles Twenty-fifth, fiftieth, and seventy-fifth percentiles
Pattern Recognition Methods
Clustering Groups records that resemble each other
Classification Rules If-then patterns associated with a target variable
Anomaly Detection Records that do not fit any pattern
Sequence Mining Recurring patterns over time or sequence
Visual Pattern Detection Patterns the eye notices when the data is properly visualised
Tools for Profiling
Excel Analysis ToolPak Data Analysis Descriptive Statistics produces a one-click full profile
Excel Built-In Functions AVERAGE, MEDIAN, STDEV.S, SKEW, KURT, PERCENTILE.INC for fine control
Power BI DAX Measures AVERAGE, MEDIAN, STDEV.S, MIN, MAX, PERCENTILE.INC for live measures
Power Query Column Profile Visual profile of distribution, missing count, distinct count, value frequency
Tableau Summary Card Worksheet Show Summary surfaces count sum average min max median standard deviation
Common Pitfalls
Mean Without Median Pitfall of reporting only the mean for a skewed distribution
SD Without Distribution Pitfall of reporting standard deviation for non-normal distributions
No Skewness or Kurtosis Pitfall of skipping shape statistics for heavy-tailed or bimodal data
Anscombe Blindness Pitfall of treating identical summary statistics as identical data
Pattern Without Significance Pitfall of reporting clusters or rules without testing significance