flowchart TD
T["Data<br>Transformation"]
T --> N["Normalisation<br>Scale to a fixed range"]
T --> S["Standardisation<br>Centre and scale by<br>distributional statistics"]
T --> M["Mathematical<br>Transformation<br>Reshape distribution"]
T --> A["Aggregation<br>Change level of<br>granularity"]
T --> E["Encoding<br>Convert categorical to<br>numerical representation"]
T --> D["Discretisation<br>Convert continuous to<br>categorical"]
style T fill:#e3f2fd,stroke:#1976D2
style N fill:#fce4ec,stroke:#AD1457
style S fill:#fff3e0,stroke:#EF6C00
style M fill:#fff8e1,stroke:#F9A825
style A fill:#e8f5e9,stroke:#388E3C
style E fill:#ede7f6,stroke:#4527A0
style D fill:#f3e5f5,stroke:#6A1B9A
23 Data Transformation: Normalization, Standardization, and Aggregation
23.1 Why Data Transformation Matters
A model is only as good as the variables it sees, and the variables it sees are the ones the analyst chooses to construct.
Even after a dataset has been cleaned, it is rarely ready for analysis or for visualisation. Variables may live on incompatible scales — a price in thousands of rupees and a quantity of single units cannot share an axis. Distributions may be heavily skewed, so a histogram becomes a single tall bar with a long invisible tail. Categorical labels need to be encoded numerically before a model can use them. The level of granularity may be daily transactions when a board dashboard expects monthly summaries. The analyst must therefore transform the data into a form that the chosen analytical method, and the chosen chart, can use.
The visualisation consequences of transformation are direct. A log transformation rescues a skewed distribution into a readable histogram. Z-score standardisation lets variables on radically different scales appear together on a single radar or parallel-coordinates chart. Time-based aggregation turns an unreadable daily sparkline into a board-ready quarterly bar chart. Tableau, Power BI, and Looker all expose transformations as first-class operations in their data-prep layers (calculated fields, Power Query M, LookML), and most of the chapter’s techniques are routinely applied inside those tools, not only in upstream code.
The discipline of feature transformation has been studied empirically for decades. The original Box–Cox transformation paper (George E. P. Box & David R. Cox, 1964) established the formal mathematical treatment of variance-stabilising and normality-inducing transformations, and the practitioner reference today is Feature Engineering and Selection by Max Kuhn & Kjell Johnson (2019), which catalogues the techniques that make raw data both analytically and visually useful.
23.2 The Major Categories of Transformation
The six recurring families of transformation:
- Normalisation: Scale a variable to a fixed range (typically 0 to 1).
- Standardisation: Centre and scale a variable using its distributional statistics.
- Mathematical Transformation: Apply a function (log, square root, Box-Cox) to reshape the distribution.
- Aggregation: Change the level of granularity at which the data is held.
- Encoding: Convert categorical values into numerical representations.
- Discretisation (Binning): Convert continuous variables into ordered categories.
23.3 Normalisation
Normalisation rescales a variable so that it falls within a defined range, most commonly 0 to 1. It is essential for distance-based algorithms (k-nearest neighbours, k-means, neural networks) where features on larger scales would dominate features on smaller scales.
| Method | Formula | When to Use |
|---|---|---|
| Min-Max Scaling | \(x' = (x - \min) / (\max - \min)\) | Bounded scale [0, 1]; familiar and easy to explain |
| Range Scaling | \(x' = a + (x - \min)(b - a) / (\max - \min)\) | Generalised range [a, b] |
| Decimal Scaling | \(x' = x / 10^k\) where \(k\) chosen so \(\max\|x'\| < 1\) | Quick scaling without distributional assumptions |
| Unit-Norm (L1) | \(x' = x / \sum\|x\|\) | Each row sums to 1; useful for proportional features |
| Unit-Norm (L2) | \(x' = x / \sqrt{\sum x^2}\) | Each row has unit Euclidean length; cosine similarity |
The principal limitation of min-max scaling is sensitivity to outliers: a single extreme value compresses the rest of the data into a narrow range near zero. When outliers are a concern, robust standardisation or a winsorising step beforehand is the right answer.
23.3.1 Worked Example: Min-Max Normalisation
Example
Consider the following exam scores: 65, 70, 78, 85, 90, 92, 100. The minimum is 65 and the maximum is 100, so the range is 35. The min-max-normalised value of 78 is
\[x' = \frac{78 - 65}{100 - 65} = \frac{13}{35} \approx 0.371\]
The smallest value maps to 0, the largest to 1, and every other value falls proportionally in between.
23.4 Standardisation
Standardisation centres the variable on its mean and scales it by its standard deviation, producing values with mean zero and standard deviation one. The transformed variable carries the interpretation of standard deviations from the mean, which is more interpretable across variables than absolute units.
| Method | Formula | When to Use |
|---|---|---|
| Z-Score (Standard Scaling) | \(x' = (x - \mu) / \sigma\) | Default for most parametric methods; assumes finite mean and variance |
| Robust Scaler | \(x' = (x - \mathrm{median}) / \mathrm{IQR}\) | Outlier-resistant alternative to z-score |
| MaxAbs Scaling | \(x' = x / \max(\|x\|)\) | Preserves zero entries; useful for sparse data |
| Mean Centring (Only) | \(x' = x - \mu\) | When the unit matters but the absolute level does not |
The z-score is the default for most regression, classification, and clustering pipelines. Trees and tree ensembles (random forests, gradient-boosted trees) are scale-invariant and do not require standardisation. Distance- and gradient-based methods do.
23.4.1 Worked Example: Z-Score Standardisation
Example
Take the same set of exam scores: 65, 70, 78, 85, 90, 92, 100. The mean is approximately 82.857 and the (sample) standard deviation is approximately 12.677. The z-score for the value 78 is
\[z = \frac{78 - 82.857}{12.677} \approx -0.383\]
The transformed value is interpreted directly as 0.38 standard deviations below the mean, regardless of the original unit. After standardisation, the entire variable has mean zero and standard deviation one, allowing comparison with any other variable on the same scale.
23.5 Mathematical Transformations
Mathematical transformations apply a non-linear function to reshape the distribution of a variable, typically to reduce skewness, stabilise variance, or linearise a relationship.
| Transformation | Formula | When to Use |
|---|---|---|
| Logarithmic | \(x' = \log(x)\) or \(\log(1 + x)\) | Right-skewed data with positive values; income, prices, counts |
| Square Root | \(x' = \sqrt{x}\) | Mildly right-skewed positive data; counts |
| Reciprocal | \(x' = 1/x\) | Heavy right skew; physical rates |
| Box-Cox | \(x' = (x^\lambda - 1)/\lambda\) for \(\lambda \neq 0\), \(\log(x)\) for \(\lambda = 0\) | Strictly positive data; choose \(\lambda\) to maximise normality |
| Yeo-Johnson | Extension of Box-Cox to non-positive values | Mixed positive and negative data |
| Quantile (Rank) | Replace value with its rank or quantile | Map any distribution to a uniform or normal distribution |
| Power (Generic) | \(x' = x^p\) for arbitrary \(p\) | Linearise relationships in regression |
The Box-Cox transformation introduced by George E. P. Box & David R. Cox (1964) chooses \(\lambda\) data-adaptively to make the transformed distribution as nearly normal as possible. Its limitation is that the input must be strictly positive; the Yeo-Johnson transformation generalises it to any real input.
23.6 Aggregation
Aggregation changes the level of granularity. Daily transaction records are aggregated into monthly totals; individual employees are aggregated into team-level statistics; per-record errors are aggregated into per-product defect rates.
The principal aggregation patterns:
Group-By Aggregation: Group records by one or more keys (region, product, customer segment, time period) and apply a summary function (sum, mean, median, min, max, count, distinct count, percentile).
Time-Based Roll-Ups: Aggregate fine-grained time-stamped data into daily, weekly, monthly, or quarterly summaries. The choice of bucket size and the choice of summary statistic both affect what the analysis sees.
Hierarchical Aggregation: Roll up across a hierarchy — branch into region into country, SKU into product family into category, employee into team into department. The same data can be summarised at every level of the hierarchy.
Window Functions: Compute aggregates over a sliding or expanding window of records (rolling 7-day average, year-to-date sum, prior-month comparison). SQL window functions (
OVER,PARTITION BY,ROWS BETWEEN), pandasrolling()andexpanding(), and dplyr’sslide_dbl()are the standard tools.Pivoting: Move from long-format to wide-format aggregations; rows of (region, month, sales) become a matrix of region-by-month sales.
Pre-aggregated Cubes: Precomputed multidimensional summaries served by OLAP engines for fast slice-and-dice analysis.
A common pitfall: aggregating before checking the underlying data. The mean of a daily series with weekend zeroes is not the same statistic as the mean of weekday-only values.
23.7 Encoding Categorical Variables
Most modelling techniques require numerical input. Categorical variables must therefore be encoded:
One-Hot Encoding (Dummy Variables): Each category becomes its own binary column. Suitable for unordered categorical variables with a moderate number of levels (under fifty or so). Drops one level as the reference to avoid the dummy-variable trap in linear models.
Label (Ordinal) Encoding: Each category gets an integer code. Appropriate for genuinely ordinal categories (low, medium, high) where the ordering matters. Misleading for nominal categories where it implies a non-existent order.
Target (Mean) Encoding: Each category is replaced by the mean of the target variable for that category. Powerful for high-cardinality categorical variables but risks data leakage; cross-validated or smoothed encoding is the safe form.
Frequency Encoding: Each category is replaced by its frequency or relative frequency in the dataset. Cheap and effective for high-cardinality variables.
Hash Encoding: Apply a hash function and bucket the result. Useful for very high-cardinality features when memory is constrained.
Binary Encoding: Convert the integer label to its binary representation across multiple columns. Compromise between one-hot and label encoding.
Embedding (Learned): Learn a low-dimensional vector representation of each category as part of the model. Standard in deep learning; requires sufficient data.
23.8 Discretisation and Binning
Discretisation converts a continuous variable into a small number of ordered categories. It can simplify reporting, reduce noise, capture non-linear effects in linear models, and align with regulatory or business definitions (age bands, income brackets, severity levels).
Common methods:
Equal-Width Binning: Divide the range into bins of equal width. Easy to compute; sensitive to outliers and skew.
Equal-Frequency (Quantile) Binning: Divide the data into bins with equal counts. Robust to skew; useful for stratified comparisons.
Custom Bins: Bins chosen to align with business or regulatory categories — for example, age bands of 18-24, 25-34, 35-44, … .
Decision-Tree-Based Binning: Use a single-variable decision tree to find splits that best separate the target. The bins are explicitly informative for the model.
MDLP (Minimum Description Length Principle): Information-theoretic binning that balances bin count against fit.
The trade-off is information loss: discretisation discards the within-bin variation. Use it when the simplification is worth the loss; otherwise, leave the variable continuous.
23.9 Feature Engineering: A Brief Note
Transformation, encoding, and aggregation are the building blocks of feature engineering — the deliberate construction of variables that capture the patterns relevant to a model. Max Kuhn & Kjell Johnson (2019) emphasise that feature engineering is often the highest-leverage step in a predictive pipeline; a thoughtfully engineered feature can outperform a great deal of model-tuning effort.
A few additional feature-engineering operations beyond the categories above:
- Interaction terms: Products or ratios of two variables that capture combined effect.
- Polynomial features: Powers of a variable to capture curvature.
- Date and time decomposition: Day-of-week, month, hour-of-day, holiday flag, week-of-year.
- Lagged and lead features: Prior-period values for time-series.
- Rolling-window aggregates: Trailing mean, max, count over a recent window.
- Cross-record features: Per-customer aggregates over recent transactions.
The discipline is to design features around the question, not to bolt them on by reflex.
23.10 Choosing the Right Transformation
| Goal | Recommended Transformation |
|---|---|
| Variables on incompatible scales for distance methods | Min-max normalisation or z-score standardisation |
| Outlier-prone variables | Robust scaler (median, IQR) or Winsorise then standardise |
| Heavily right-skewed positive variable | Log or Box-Cox |
| Skewed variable with negative values | Yeo-Johnson |
| Reduce noise, capture non-linearity for linear model | Discretisation with custom bins |
| Categorical variable with few levels | One-hot encoding |
| Categorical variable with many levels | Target, frequency, or hash encoding |
| Genuinely ordinal categorical | Ordinal (label) encoding |
| Time-series fine-grained data summarised at higher level | Time-based roll-up with appropriate window |
| Capture rate of change | Lag, difference, or rolling-window features |
| Tree-based model | Encoding required; scaling not needed |
| Linear, distance, or gradient-based model | Both encoding and scaling required |
23.11 Transformations and the Chart
Transformations are not only modelling preparation. They are visual decisions. The same dataset rendered with and without an appropriate transformation produces very different chart messages.
Log scales for skewed magnitudes: Income, sales, market capitalisation, and most counts are right-skewed. A linear-axis bar chart gives all the visual weight to the largest values; a log-axis chart compresses the range and lets the eye see the shape of the distribution. Tableau, Power BI, and ggplot2 all support log axes natively (in Tableau, Edit Axis → Logarithmic; in Power BI, Y-axis → Scale type: Log).
Standardisation for multi-variable comparison: A radar chart, a parallel-coordinates chart, or a heatmap of features must operate on standardised values. Without it, a single high-magnitude variable dominates and the others appear flat at the centre.
Aggregation for the audience cadence: A line chart of daily web visits looks chaotic; the same data rolled up to weekly or monthly looks like a trend. The cadence of the chart should match the cadence of the decision the dashboard supports — daily for operations, weekly or monthly for tactics, quarterly or annually for strategy.
Discretisation for grouped charts: A continuous age variable becomes a meaningful axis on a grouped bar chart only when binned into bands (18–24, 25–34, …). Quantile-based binning produces equally populated bins that read cleanly across categories.
Encoding for categorical visuals: One-hot encoding before clustering or modelling is invisible in the final chart, but the labels attached to the resulting clusters or bars come from the original categorical variable. Keep the human-readable labels alongside the encoded versions.
Reverse the transformation for reporting: A model that trains on log-transformed quantities must report in original units on the dashboard. A z-score reported on an executive dashboard means nothing without context — convert back to standard deviations from the target or percentage variance before display.
Tool-native transformations: Tableau’s Quick Table Calculations (running total, percent of total, year-over-year), Power BI’s DAX measures and Power Query M steps, and dbt’s
snapshotsandmodelsare all transformation-as-visualisation primitives. Use them rather than transforming upstream when the audience’s question is intrinsic to the chart.
23.12 Common Pitfalls
Scaling Test Data with Test Statistics: Computing the scaling parameters on the test set, leaking information from the test back into training. Always fit the scaler on training data and apply to test data.
One-Hot Explosion: One-hot encoding a high-cardinality variable (postcode, customer ID), producing thousands of sparse columns. Use a high-cardinality encoder instead.
Target Encoding Without Cross-Validation: Computing the target mean per category over the entire training set, leaking the target into the feature. Use cross-validated or smoothed encoding.
Log of Zero or Negative: Applying
log(x)to data containing zeroes or negatives. Uselog(1 + x)for non-negative data, or Yeo-Johnson for mixed sign.Min-Max Sensitivity to Outliers: A single extreme value compressing the rest of the variable into a narrow band near zero.
Ordinal Encoding of Nominal Categories: Assigning integers to Mumbai = 1, Delhi = 2, Chennai = 3 and feeding the result to a linear model that interprets the integers as ordered.
Aggregating Before Inspecting: Computing means and totals before checking for missing values, weekend zeroes, or pipeline gaps that will distort the aggregate.
Aggregating Across Heterogeneous Groups: Computing a single mean across populations that differ in important ways; Simpson’s-paradox-class errors.
Discretisation That Discards Signal: Binning a continuous predictor too coarsely and losing the relationship the model could have exploited.
Forgotten Reversal: Applying a transformation in training and forgetting to invert it when reporting predictions in original units.
Different Transforms in Train and Production: A subtle pipeline drift where training and production scale or encode differently. Codify the transformation pipeline and version it.
Magic Numbers in Bin Boundaries: Hard-coding bin boundaries that are not aligned with the business or regulatory definitions the audience expects.
Over-Engineering: Producing dozens of variants of every variable when a small thoughtful set is more useful.
23.13 Illustrative Cases
The following short cases illustrate transformation decisions in practice. They describe common situations and the reasoning behind the design.
Credit Risk Model with Skewed Income
A credit-scoring model has applicant income as a predictor. The raw distribution is right-skewed, with a long tail of high earners. Linear regression on the raw variable fits poorly; a log transformation \(\log(1 + \text{income})\) produces a much closer-to-normal distribution and improves both fit and interpretability — a one-unit increase on the log scale is a multiplicative effect on the original scale.
Clustering Customers with Variables on Different Scales
A retailer wants to cluster customers using annual spend in rupees (range 0 to 50,00,000), visit frequency per year (range 0 to 100), and recency in days (range 0 to 365). Without scaling, k-means is dominated by the first variable. The team applies z-score standardisation to all three before clustering, and the resulting segments make business sense for the first time.
Sales Roll-Up for a Quarterly Review
A sales dashboard shows daily transaction counts. The board reviews the firm quarterly, not daily. The team builds aggregation pipelines that roll daily transactions up to weekly, monthly, and quarterly summaries, with year-on-year comparisons computed at each level. The same underlying dataset now serves operational dashboards, weekly management views, and quarterly board reports without recalculation.
High-Cardinality Encoding for an Indian Pincode Variable
A churn model uses pincode — a six-digit Indian postal code with several thousand distinct values — as a predictor. One-hot encoding would produce a sparse matrix with thousands of columns. The team uses target encoding with cross-validation, replacing each pincode with the cross-validated mean churn rate among customers in that pincode, smoothed against the global mean to avoid noise on rare codes.
A Pitfall of Transformation Without Reversal
A demand-forecast model trains on log-transformed quantities and produces excellent log-scale forecasts. A junior analyst publishes the log values directly to the operations dashboard. Operators read the numbers as if they were quantities and order accordingly. The lesson: every transformation applied for modelling must be inverted when reporting in business units, and the inversion must be unit-tested.
23.14 Hands-On Exercise: Transformation and Standardisation in Power BI and Excel
Aim: Apply the three most common data transformations — min-max normalisation, z-score standardisation, and time-based aggregation — using Power BI and Excel, and visualise the before-and-after.
Scenario: An analyst at Yuvijen Stores Pvt Ltd must put four very different KPIs onto a single radar chart for the board’s quarterly review:
- Revenue — measured in lakh of rupees (range 50–120)
- NPS — measured on a 0–100 scale (range 30–60)
- Fulfilment rate — percentage (range 85–98)
- Delivery time — measured in hours (range 20–50; lower is better)
Without transformation, the radar chart is dominated by the largest-magnitude variable; standardisation puts them all on a comparable footing.
Deliverable: An Excel workbook and a Power BI report demonstrating the three transformations on the same dataset.
23.14.1 Step 1 — The Sample Data
| month | revenue_lakh | nps | fulfilment_pct | delivery_hours |
|---|---|---|---|---|
| 2025-04 | 64 | 38 | 88 | 44 |
| 2025-05 | 70 | 39 | 89 | 42 |
| 2025-06 | 72 | 40 | 91 | 41 |
| 2025-07 | 75 | 41 | 92 | 40 |
| 2025-08 | 78 | 43 | 92 | 39 |
| 2025-09 | 80 | 44 | 93 | 38 |
| 2025-10 | 84 | 45 | 94 | 37 |
| 2025-11 | 88 | 46 | 94 | 36 |
| 2025-12 | 96 | 46 | 95 | 35 |
| 2026-01 | 92 | 47 | 95 | 34 |
| 2026-02 | 100 | 48 | 96 | 33 |
| 2026-03 | 110 | 50 | 97 | 32 |
23.14.2 Step 2 — Min-Max Normalisation in Excel
Open the dataset in Excel. In a new column for each variable, apply min-max normalisation so every value falls between 0 and 1. The Excel formula:
=(B2 - MIN($B$2:$B$13)) / (MAX($B$2:$B$13) - MIN($B$2:$B$13))
For delivery_hours, where lower is better, reverse the direction by computing 1 - normalized_value so that the highest score still represents the best performance.
After applying the formula to all four KPI columns, every transformed value is between 0 and 1, and the four KPIs are directly comparable on the same scale.
| Metric | Min | Max | Latest Value | Normalised Latest |
|---|---|---|---|---|
| Revenue (lakh) | 64 | 110 | 110 | (110-64)/(110-64) = 1.000 |
| NPS | 38 | 50 | 50 | (50-38)/(50-38) = 1.000 |
| Fulfilment (%) | 88 | 97 | 97 | (97-88)/(97-88) = 1.000 |
| Delivery (hrs)* | 32 | 44 | 32 | 1 - (32-32)/(44-32) = 1.000 |
* Reversed because lower delivery time is better.
23.14.3 Step 3 — Min-Max Normalisation in Power BI
In Power BI Desktop, use DAX measures to compute the same normalisation dynamically:
Min Revenue = MIN('kpi_quarterly'[revenue_lakh])
Max Revenue = MAX('kpi_quarterly'[revenue_lakh])
Normalised Revenue =
DIVIDE(
'kpi_quarterly'[revenue_lakh] - [Min Revenue],
[Max Revenue] - [Min Revenue]
)
Repeat for NPS and fulfilment percentage. For delivery hours (lower is better):
Normalised Delivery =
1 - DIVIDE(
'kpi_quarterly'[delivery_hours] - MIN('kpi_quarterly'[delivery_hours]),
MAX('kpi_quarterly'[delivery_hours]) - MIN('kpi_quarterly'[delivery_hours])
)
The DAX-based approach is dynamic — when filters or slicers narrow the dataset, the min and max recalculate, and the normalisation adjusts accordingly. The Excel approach pins the bounds to a fixed range; the Power BI approach reflects whatever subset the user has selected.
23.14.4 Step 4 — Z-Score Standardisation in Excel
Excel ships with a built-in STANDARDIZE function that returns the z-score directly:
=STANDARDIZE(B2, AVERAGE($B$2:$B$13), STDEV.S($B$2:$B$13))
The formula returns (value − mean) / sample-standard-deviation — the canonical z-score. After applying it to all four KPI columns, each value is interpreted as standard deviations from the mean, regardless of original units.
For delivery hours (lower is better), simply negate the result so that a high z-score still represents better performance:
=-STANDARDIZE(B2, AVERAGE($B$2:$B$13), STDEV.S($B$2:$B$13))
| Metric | Mean | Std Dev | Latest Value | Z-Score (Latest) |
|---|---|---|---|---|
| Revenue (lakh) | 84.08 | 13.83 | 110 | (110-84.08)/13.83 ≈ 1.87 |
| NPS | 43.92 | 3.83 | 50 | (50-43.92)/3.83 ≈ 1.59 |
| Fulfilment (%) | 93.00 | 2.66 | 97 | (97-93)/2.66 ≈ 1.50 |
| Delivery (hrs)* | 37.58 | 3.74 | 32 | -(32-37.58)/3.74 ≈ 1.49 |
* Negated because lower delivery time is better.
The four values land in a comparable range — between roughly 1.5 and 2 — and any of them could now sit on a shared axis.
23.14.5 Step 5 — Z-Score Standardisation in Power BI
In DAX:
Avg Revenue = AVERAGE('kpi_quarterly'[revenue_lakh])
Std Revenue = STDEV.S('kpi_quarterly'[revenue_lakh])
ZScore Revenue =
DIVIDE(
'kpi_quarterly'[revenue_lakh] - [Avg Revenue],
[Std Revenue]
)
Repeat for NPS, fulfilment percentage, and delivery hours (with the sign reversed for delivery as in Excel).
The DAX STDEV.S function returns the sample standard deviation, matching Excel’s STDEV.S — the convention adopted by most modern statistical software.
23.14.6 Step 6 — Time-Based Aggregation in Power BI
Beyond scaling, the most common business transformation is rolling up time. The board does not want to see twelve monthly numbers; it wants four quarterly summaries.
In Power BI, an aggregated DAX measure does this without altering the underlying table:
Quarterly Revenue =
CALCULATE(
SUM('kpi_quarterly'[revenue_lakh]),
DATESINPERIOD(
'Date'[Date],
MAX('Date'[Date]),
-3, MONTH
)
)
The same idea expressed as a Power Query step: Group By → group by quarter, aggregate revenue using Sum. The Power Query approach materialises a new table; the DAX approach computes on the fly.
For a YTD or running total: Revenue YTD = TOTALYTD(SUM('kpi_quarterly'[revenue_lakh]), 'Date'[Date]). Power BI’s time-intelligence functions cover most aggregation patterns without custom DAX.
23.14.7 Step 7 — Visualise Before and After
Build a small Power BI page demonstrating the effect of standardisation:
- Before: A radar (or polygon) chart with the four KPIs in their raw units. Revenue dominates the visual; the other three metrics hug the centre and are illegible.
- After (Min-Max): The same radar with min-max-normalised values. All four KPIs span the same 0-to-1 range; the comparison is visually fair.
- After (Z-Score): A multi-line chart of the four z-score series over twelve months. Each variable’s progression is comparable on a shared y-axis.
The before-and-after pair is the visual proof that transformation is itself a design decision, not just modelling preparation.
23.14.8 Step 8 — Connect to the Visualisation Layer
Several visualisation patterns from earlier modules now make sense as transformations applied at the BI layer:
- Radar and parallel-coordinates charts require either min-max or z-score scaling — without it, the largest-magnitude variable dominates and the others vanish.
- Heatmaps and small multiples of standardised values let the eye compare patterns across metrics with very different units.
- Index charts (a line chart starting at 100 for everyone in the base period) are a min-max-style transformation expressed as a visual idiom.
- Year-on-year and period-on-period comparisons are aggregation transformations that Power BI’s time-intelligence DAX functions render almost free of effort.
The hands-on demonstrates that transformation is rarely an upstream-only concern in modern BI work; it is built into Excel formulas, DAX measures, and Power Query steps that the dashboard developer applies as part of the visualisation pipeline.
Excel workbook (yuvijen-transformations.xlsx), Power BI file (yuvijen-transformations.pbix), the source kpi_quarterly.csv, and screen recordings of the min-max, z-score, and aggregation walk-throughs in both tools will be embedded here.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why Data Transformation Matters | Variables on incompatible scales, skewed distributions, and wrong granularity all need transformation |
| Six Transformation Families | |
| Normalisation Family | Scale a variable to a fixed range, typically 0 to 1 |
| Standardisation Family | Centre and scale by distributional statistics, typically mean and standard deviation |
| Mathematical Transformation Family | Apply a non-linear function to reshape distribution, reduce skew, or stabilise variance |
| Aggregation Family | Change the level of granularity from fine-grained records to summary statistics |
| Encoding Family | Convert categorical values into numerical representations a model can use |
| Discretisation Family | Convert continuous variables into ordered categories or bins |
| Normalisation Methods | |
| Min-Max Scaling | Linear scaling to bounded range; sensitive to outliers |
| Range Scaling | Generalised range scaling to any interval |
| Decimal Scaling | Divide by a power of ten until the largest absolute value is below one |
| Unit-Norm L1 | Scale each row to sum to one across its features; useful for proportional features |
| Unit-Norm L2 | Scale each row to unit Euclidean length; basis for cosine similarity |
| Standardisation Methods | |
| Z-Score Standardisation | Subtract mean and divide by standard deviation; default for parametric methods |
| Robust Scaler | Subtract median and divide by interquartile range; outlier-resistant |
| MaxAbs Scaling | Divide by largest absolute value; preserves zero entries for sparse data |
| Mean Centring | Subtract mean only; preserve unit while removing absolute level |
| Mathematical Transformations | |
| Logarithmic Transformation | Reduce right skew in positive data; income, prices, counts |
| Square Root Transformation | Reduce mild right skew in positive data such as counts |
| Reciprocal Transformation | Reduce heavy right skew; physical rates and ratios |
| Box-Cox Transformation | Power transformation that finds the lambda producing the most normal distribution |
| Yeo-Johnson Transformation | Extension of Box-Cox to non-positive values; mixed sign data |
| Quantile (Rank) Transformation | Replace each value with its rank or quantile to map any distribution to uniform or normal |
| Power Transformation | Generic power transformation for linearising relationships in regression |
| Aggregation Patterns | |
| Group-By Aggregation | Group records by keys and apply summary functions like sum, mean, median, count |
| Time-Based Roll-Ups | Aggregate fine-grained time-stamped data into daily, weekly, monthly summaries |
| Hierarchical Aggregation | Roll up across a hierarchy from leaf entities to higher levels |
| Window Functions | Compute aggregates over sliding or expanding windows; rolling means and YTD sums |
| Pivoting | Move from long-format to wide-format aggregations; rows become matrix cells |
| Pre-Aggregated Cubes | Precomputed multidimensional summaries served by OLAP engines for fast analysis |
| Categorical Encoding | |
| One-Hot Encoding | Each category becomes its own binary column; suitable for unordered with moderate cardinality |
| Label (Ordinal) Encoding | Each category gets an integer code; appropriate for ordinal where order matters |
| Target (Mean) Encoding | Replace category by mean of target for that category; powerful but risks leakage |
| Frequency Encoding | Replace category by its frequency in the dataset; cheap for high cardinality |
| Hash Encoding | Apply hash function and bucket the result; for very high cardinality with memory constraints |
| Binary Encoding | Convert integer label to its binary representation across multiple columns |
| Embedding Encoding | Learned low-dimensional vector representation; standard in deep learning |
| Discretisation Methods | |
| Equal-Width Binning | Divide range into bins of equal width; sensitive to outliers and skew |
| Equal-Frequency Binning | Divide data into bins with equal counts; robust to skew |
| Custom Bins | Bins chosen to align with business or regulatory categories |
| Decision-Tree-Based Binning | Use a decision tree to find splits that best separate the target |
| MDLP Binning | Information-theoretic binning that balances bin count against fit |
| Feature Engineering Operations | |
| Interaction Terms | Products or ratios of two variables that capture combined effect |
| Polynomial Features | Powers of a variable to capture curvature in linear models |
| Date and Time Decomposition | Day-of-week, month, hour-of-day, holiday flag, week-of-year features |
| Lagged and Lead Features | Prior-period or future-period values for time-series modelling |
| Rolling-Window Aggregates | Trailing mean, max, count over a recent window for time-series |
| Cross-Record Features | Per-entity aggregates of recent transactions or events |
| Common Pitfalls | |
| Scaling Test Data with Test Statistics | Pitfall of computing scaling parameters on the test set, leaking information from test to training |
| One-Hot Explosion | Pitfall of one-hot encoding a high-cardinality variable producing thousands of sparse columns |
| Target Encoding Without CV | Pitfall of computing target mean per category over the entire training set, leaking target into feature |
| Log of Zero or Negative | Pitfall of applying log to data containing zero or negative values without offset or generalisation |
| Min-Max Outlier Sensitivity | Pitfall of a single extreme value compressing the rest of the variable near zero |
| Ordinal Encoding of Nominal | Pitfall of assigning integers to nominal categories that the model interprets as ordered |
| Aggregating Before Inspecting | Pitfall of computing means and totals before checking for missing values or pipeline gaps |
| Aggregating Heterogeneous Groups | Pitfall of a single mean across populations that differ in important ways; Simpson-class errors |
| Discretisation Discards Signal | Pitfall of binning too coarsely and losing the predictive relationship |
| Forgotten Reversal | Pitfall of applying a transformation for modelling and forgetting to invert it for reporting |
| Train-Production Pipeline Drift | Pitfall of training and production pipelines scaling or encoding differently |
| Magic Numbers in Bin Boundaries | Pitfall of hard-coded bin boundaries not aligned with business or regulatory definitions |
| Over-Engineering | Pitfall of producing dozens of variants of every variable when a small thoughtful set is more useful |