23  Data Transformation: Normalization, Standardization, and Aggregation

23.1 Why Data Transformation Matters

A model is only as good as the variables it sees, and the variables it sees are the ones the analyst chooses to construct.

Even after a dataset has been cleaned, it is rarely ready for analysis or for visualisation. Variables may live on incompatible scales — a price in thousands of rupees and a quantity of single units cannot share an axis. Distributions may be heavily skewed, so a histogram becomes a single tall bar with a long invisible tail. Categorical labels need to be encoded numerically before a model can use them. The level of granularity may be daily transactions when a board dashboard expects monthly summaries. The analyst must therefore transform the data into a form that the chosen analytical method, and the chosen chart, can use.

The visualisation consequences of transformation are direct. A log transformation rescues a skewed distribution into a readable histogram. Z-score standardisation lets variables on radically different scales appear together on a single radar or parallel-coordinates chart. Time-based aggregation turns an unreadable daily sparkline into a board-ready quarterly bar chart. Tableau, Power BI, and Looker all expose transformations as first-class operations in their data-prep layers (calculated fields, Power Query M, LookML), and most of the chapter’s techniques are routinely applied inside those tools, not only in upstream code.

The discipline of feature transformation has been studied empirically for decades. The original Box–Cox transformation paper (George E. P. Box & David R. Cox, 1964) established the formal mathematical treatment of variance-stabilising and normality-inducing transformations, and the practitioner reference today is Feature Engineering and Selection by Max Kuhn & Kjell Johnson (2019), which catalogues the techniques that make raw data both analytically and visually useful.

23.2 The Major Categories of Transformation

flowchart TD
    T["Data<br>Transformation"]
    T --> N["Normalisation<br>Scale to a fixed range"]
    T --> S["Standardisation<br>Centre and scale by<br>distributional statistics"]
    T --> M["Mathematical<br>Transformation<br>Reshape distribution"]
    T --> A["Aggregation<br>Change level of<br>granularity"]
    T --> E["Encoding<br>Convert categorical to<br>numerical representation"]
    T --> D["Discretisation<br>Convert continuous to<br>categorical"]
    style T fill:#e3f2fd,stroke:#1976D2
    style N fill:#fce4ec,stroke:#AD1457
    style S fill:#fff3e0,stroke:#EF6C00
    style M fill:#fff8e1,stroke:#F9A825
    style A fill:#e8f5e9,stroke:#388E3C
    style E fill:#ede7f6,stroke:#4527A0
    style D fill:#f3e5f5,stroke:#6A1B9A

The six recurring families of transformation:

  • Normalisation: Scale a variable to a fixed range (typically 0 to 1).
  • Standardisation: Centre and scale a variable using its distributional statistics.
  • Mathematical Transformation: Apply a function (log, square root, Box-Cox) to reshape the distribution.
  • Aggregation: Change the level of granularity at which the data is held.
  • Encoding: Convert categorical values into numerical representations.
  • Discretisation (Binning): Convert continuous variables into ordered categories.

23.3 Normalisation

Normalisation rescales a variable so that it falls within a defined range, most commonly 0 to 1. It is essential for distance-based algorithms (k-nearest neighbours, k-means, neural networks) where features on larger scales would dominate features on smaller scales.

TipCommon Normalisation Methods
Method Formula When to Use
Min-Max Scaling \(x' = (x - \min) / (\max - \min)\) Bounded scale [0, 1]; familiar and easy to explain
Range Scaling \(x' = a + (x - \min)(b - a) / (\max - \min)\) Generalised range [a, b]
Decimal Scaling \(x' = x / 10^k\) where \(k\) chosen so \(\max\|x'\| < 1\) Quick scaling without distributional assumptions
Unit-Norm (L1) \(x' = x / \sum\|x\|\) Each row sums to 1; useful for proportional features
Unit-Norm (L2) \(x' = x / \sqrt{\sum x^2}\) Each row has unit Euclidean length; cosine similarity

The principal limitation of min-max scaling is sensitivity to outliers: a single extreme value compresses the rest of the data into a narrow range near zero. When outliers are a concern, robust standardisation or a winsorising step beforehand is the right answer.

23.3.1 Worked Example: Min-Max Normalisation

Example

Consider the following exam scores: 65, 70, 78, 85, 90, 92, 100. The minimum is 65 and the maximum is 100, so the range is 35. The min-max-normalised value of 78 is

\[x' = \frac{78 - 65}{100 - 65} = \frac{13}{35} \approx 0.371\]

The smallest value maps to 0, the largest to 1, and every other value falls proportionally in between.

23.4 Standardisation

Standardisation centres the variable on its mean and scales it by its standard deviation, producing values with mean zero and standard deviation one. The transformed variable carries the interpretation of standard deviations from the mean, which is more interpretable across variables than absolute units.

TipCommon Standardisation Methods
Method Formula When to Use
Z-Score (Standard Scaling) \(x' = (x - \mu) / \sigma\) Default for most parametric methods; assumes finite mean and variance
Robust Scaler \(x' = (x - \mathrm{median}) / \mathrm{IQR}\) Outlier-resistant alternative to z-score
MaxAbs Scaling \(x' = x / \max(\|x\|)\) Preserves zero entries; useful for sparse data
Mean Centring (Only) \(x' = x - \mu\) When the unit matters but the absolute level does not

The z-score is the default for most regression, classification, and clustering pipelines. Trees and tree ensembles (random forests, gradient-boosted trees) are scale-invariant and do not require standardisation. Distance- and gradient-based methods do.

23.4.1 Worked Example: Z-Score Standardisation

Example

Take the same set of exam scores: 65, 70, 78, 85, 90, 92, 100. The mean is approximately 82.857 and the (sample) standard deviation is approximately 12.677. The z-score for the value 78 is

\[z = \frac{78 - 82.857}{12.677} \approx -0.383\]

The transformed value is interpreted directly as 0.38 standard deviations below the mean, regardless of the original unit. After standardisation, the entire variable has mean zero and standard deviation one, allowing comparison with any other variable on the same scale.

23.5 Mathematical Transformations

Mathematical transformations apply a non-linear function to reshape the distribution of a variable, typically to reduce skewness, stabilise variance, or linearise a relationship.

TipCommon Mathematical Transformations
Transformation Formula When to Use
Logarithmic \(x' = \log(x)\) or \(\log(1 + x)\) Right-skewed data with positive values; income, prices, counts
Square Root \(x' = \sqrt{x}\) Mildly right-skewed positive data; counts
Reciprocal \(x' = 1/x\) Heavy right skew; physical rates
Box-Cox \(x' = (x^\lambda - 1)/\lambda\) for \(\lambda \neq 0\), \(\log(x)\) for \(\lambda = 0\) Strictly positive data; choose \(\lambda\) to maximise normality
Yeo-Johnson Extension of Box-Cox to non-positive values Mixed positive and negative data
Quantile (Rank) Replace value with its rank or quantile Map any distribution to a uniform or normal distribution
Power (Generic) \(x' = x^p\) for arbitrary \(p\) Linearise relationships in regression

The Box-Cox transformation introduced by George E. P. Box & David R. Cox (1964) chooses \(\lambda\) data-adaptively to make the transformed distribution as nearly normal as possible. Its limitation is that the input must be strictly positive; the Yeo-Johnson transformation generalises it to any real input.

23.6 Aggregation

Aggregation changes the level of granularity. Daily transaction records are aggregated into monthly totals; individual employees are aggregated into team-level statistics; per-record errors are aggregated into per-product defect rates.

The principal aggregation patterns:

  • Group-By Aggregation: Group records by one or more keys (region, product, customer segment, time period) and apply a summary function (sum, mean, median, min, max, count, distinct count, percentile).

  • Time-Based Roll-Ups: Aggregate fine-grained time-stamped data into daily, weekly, monthly, or quarterly summaries. The choice of bucket size and the choice of summary statistic both affect what the analysis sees.

  • Hierarchical Aggregation: Roll up across a hierarchy — branch into region into country, SKU into product family into category, employee into team into department. The same data can be summarised at every level of the hierarchy.

  • Window Functions: Compute aggregates over a sliding or expanding window of records (rolling 7-day average, year-to-date sum, prior-month comparison). SQL window functions (OVER, PARTITION BY, ROWS BETWEEN), pandas rolling() and expanding(), and dplyr’s slide_dbl() are the standard tools.

  • Pivoting: Move from long-format to wide-format aggregations; rows of (region, month, sales) become a matrix of region-by-month sales.

  • Pre-aggregated Cubes: Precomputed multidimensional summaries served by OLAP engines for fast slice-and-dice analysis.

A common pitfall: aggregating before checking the underlying data. The mean of a daily series with weekend zeroes is not the same statistic as the mean of weekday-only values.

23.7 Encoding Categorical Variables

Most modelling techniques require numerical input. Categorical variables must therefore be encoded:

  • One-Hot Encoding (Dummy Variables): Each category becomes its own binary column. Suitable for unordered categorical variables with a moderate number of levels (under fifty or so). Drops one level as the reference to avoid the dummy-variable trap in linear models.

  • Label (Ordinal) Encoding: Each category gets an integer code. Appropriate for genuinely ordinal categories (low, medium, high) where the ordering matters. Misleading for nominal categories where it implies a non-existent order.

  • Target (Mean) Encoding: Each category is replaced by the mean of the target variable for that category. Powerful for high-cardinality categorical variables but risks data leakage; cross-validated or smoothed encoding is the safe form.

  • Frequency Encoding: Each category is replaced by its frequency or relative frequency in the dataset. Cheap and effective for high-cardinality variables.

  • Hash Encoding: Apply a hash function and bucket the result. Useful for very high-cardinality features when memory is constrained.

  • Binary Encoding: Convert the integer label to its binary representation across multiple columns. Compromise between one-hot and label encoding.

  • Embedding (Learned): Learn a low-dimensional vector representation of each category as part of the model. Standard in deep learning; requires sufficient data.

23.8 Discretisation and Binning

Discretisation converts a continuous variable into a small number of ordered categories. It can simplify reporting, reduce noise, capture non-linear effects in linear models, and align with regulatory or business definitions (age bands, income brackets, severity levels).

Common methods:

  • Equal-Width Binning: Divide the range into bins of equal width. Easy to compute; sensitive to outliers and skew.

  • Equal-Frequency (Quantile) Binning: Divide the data into bins with equal counts. Robust to skew; useful for stratified comparisons.

  • Custom Bins: Bins chosen to align with business or regulatory categories — for example, age bands of 18-24, 25-34, 35-44, … .

  • Decision-Tree-Based Binning: Use a single-variable decision tree to find splits that best separate the target. The bins are explicitly informative for the model.

  • MDLP (Minimum Description Length Principle): Information-theoretic binning that balances bin count against fit.

The trade-off is information loss: discretisation discards the within-bin variation. Use it when the simplification is worth the loss; otherwise, leave the variable continuous.

23.9 Feature Engineering: A Brief Note

Transformation, encoding, and aggregation are the building blocks of feature engineering — the deliberate construction of variables that capture the patterns relevant to a model. Max Kuhn & Kjell Johnson (2019) emphasise that feature engineering is often the highest-leverage step in a predictive pipeline; a thoughtfully engineered feature can outperform a great deal of model-tuning effort.

A few additional feature-engineering operations beyond the categories above:

  • Interaction terms: Products or ratios of two variables that capture combined effect.
  • Polynomial features: Powers of a variable to capture curvature.
  • Date and time decomposition: Day-of-week, month, hour-of-day, holiday flag, week-of-year.
  • Lagged and lead features: Prior-period values for time-series.
  • Rolling-window aggregates: Trailing mean, max, count over a recent window.
  • Cross-record features: Per-customer aggregates over recent transactions.

The discipline is to design features around the question, not to bolt them on by reflex.

23.10 Choosing the Right Transformation

TipA Decision Guide
Goal Recommended Transformation
Variables on incompatible scales for distance methods Min-max normalisation or z-score standardisation
Outlier-prone variables Robust scaler (median, IQR) or Winsorise then standardise
Heavily right-skewed positive variable Log or Box-Cox
Skewed variable with negative values Yeo-Johnson
Reduce noise, capture non-linearity for linear model Discretisation with custom bins
Categorical variable with few levels One-hot encoding
Categorical variable with many levels Target, frequency, or hash encoding
Genuinely ordinal categorical Ordinal (label) encoding
Time-series fine-grained data summarised at higher level Time-based roll-up with appropriate window
Capture rate of change Lag, difference, or rolling-window features
Tree-based model Encoding required; scaling not needed
Linear, distance, or gradient-based model Both encoding and scaling required

23.11 Transformations and the Chart

Transformations are not only modelling preparation. They are visual decisions. The same dataset rendered with and without an appropriate transformation produces very different chart messages.

  • Log scales for skewed magnitudes: Income, sales, market capitalisation, and most counts are right-skewed. A linear-axis bar chart gives all the visual weight to the largest values; a log-axis chart compresses the range and lets the eye see the shape of the distribution. Tableau, Power BI, and ggplot2 all support log axes natively (in Tableau, Edit Axis → Logarithmic; in Power BI, Y-axis → Scale type: Log).

  • Standardisation for multi-variable comparison: A radar chart, a parallel-coordinates chart, or a heatmap of features must operate on standardised values. Without it, a single high-magnitude variable dominates and the others appear flat at the centre.

  • Aggregation for the audience cadence: A line chart of daily web visits looks chaotic; the same data rolled up to weekly or monthly looks like a trend. The cadence of the chart should match the cadence of the decision the dashboard supports — daily for operations, weekly or monthly for tactics, quarterly or annually for strategy.

  • Discretisation for grouped charts: A continuous age variable becomes a meaningful axis on a grouped bar chart only when binned into bands (18–24, 25–34, …). Quantile-based binning produces equally populated bins that read cleanly across categories.

  • Encoding for categorical visuals: One-hot encoding before clustering or modelling is invisible in the final chart, but the labels attached to the resulting clusters or bars come from the original categorical variable. Keep the human-readable labels alongside the encoded versions.

  • Reverse the transformation for reporting: A model that trains on log-transformed quantities must report in original units on the dashboard. A z-score reported on an executive dashboard means nothing without context — convert back to standard deviations from the target or percentage variance before display.

  • Tool-native transformations: Tableau’s Quick Table Calculations (running total, percent of total, year-over-year), Power BI’s DAX measures and Power Query M steps, and dbt’s snapshots and models are all transformation-as-visualisation primitives. Use them rather than transforming upstream when the audience’s question is intrinsic to the chart.

23.12 Common Pitfalls

  • Scaling Test Data with Test Statistics: Computing the scaling parameters on the test set, leaking information from the test back into training. Always fit the scaler on training data and apply to test data.

  • One-Hot Explosion: One-hot encoding a high-cardinality variable (postcode, customer ID), producing thousands of sparse columns. Use a high-cardinality encoder instead.

  • Target Encoding Without Cross-Validation: Computing the target mean per category over the entire training set, leaking the target into the feature. Use cross-validated or smoothed encoding.

  • Log of Zero or Negative: Applying log(x) to data containing zeroes or negatives. Use log(1 + x) for non-negative data, or Yeo-Johnson for mixed sign.

  • Min-Max Sensitivity to Outliers: A single extreme value compressing the rest of the variable into a narrow band near zero.

  • Ordinal Encoding of Nominal Categories: Assigning integers to Mumbai = 1, Delhi = 2, Chennai = 3 and feeding the result to a linear model that interprets the integers as ordered.

  • Aggregating Before Inspecting: Computing means and totals before checking for missing values, weekend zeroes, or pipeline gaps that will distort the aggregate.

  • Aggregating Across Heterogeneous Groups: Computing a single mean across populations that differ in important ways; Simpson’s-paradox-class errors.

  • Discretisation That Discards Signal: Binning a continuous predictor too coarsely and losing the relationship the model could have exploited.

  • Forgotten Reversal: Applying a transformation in training and forgetting to invert it when reporting predictions in original units.

  • Different Transforms in Train and Production: A subtle pipeline drift where training and production scale or encode differently. Codify the transformation pipeline and version it.

  • Magic Numbers in Bin Boundaries: Hard-coding bin boundaries that are not aligned with the business or regulatory definitions the audience expects.

  • Over-Engineering: Producing dozens of variants of every variable when a small thoughtful set is more useful.

23.13 Illustrative Cases

The following short cases illustrate transformation decisions in practice. They describe common situations and the reasoning behind the design.

Credit Risk Model with Skewed Income

A credit-scoring model has applicant income as a predictor. The raw distribution is right-skewed, with a long tail of high earners. Linear regression on the raw variable fits poorly; a log transformation \(\log(1 + \text{income})\) produces a much closer-to-normal distribution and improves both fit and interpretability — a one-unit increase on the log scale is a multiplicative effect on the original scale.

Clustering Customers with Variables on Different Scales

A retailer wants to cluster customers using annual spend in rupees (range 0 to 50,00,000), visit frequency per year (range 0 to 100), and recency in days (range 0 to 365). Without scaling, k-means is dominated by the first variable. The team applies z-score standardisation to all three before clustering, and the resulting segments make business sense for the first time.

Sales Roll-Up for a Quarterly Review

A sales dashboard shows daily transaction counts. The board reviews the firm quarterly, not daily. The team builds aggregation pipelines that roll daily transactions up to weekly, monthly, and quarterly summaries, with year-on-year comparisons computed at each level. The same underlying dataset now serves operational dashboards, weekly management views, and quarterly board reports without recalculation.

High-Cardinality Encoding for an Indian Pincode Variable

A churn model uses pincode — a six-digit Indian postal code with several thousand distinct values — as a predictor. One-hot encoding would produce a sparse matrix with thousands of columns. The team uses target encoding with cross-validation, replacing each pincode with the cross-validated mean churn rate among customers in that pincode, smoothed against the global mean to avoid noise on rare codes.

A Pitfall of Transformation Without Reversal

A demand-forecast model trains on log-transformed quantities and produces excellent log-scale forecasts. A junior analyst publishes the log values directly to the operations dashboard. Operators read the numbers as if they were quantities and order accordingly. The lesson: every transformation applied for modelling must be inverted when reporting in business units, and the inversion must be unit-tested.


23.14 Hands-On Exercise: Transformation and Standardisation in Power BI and Excel

Aim: Apply the three most common data transformations — min-max normalisation, z-score standardisation, and time-based aggregation — using Power BI and Excel, and visualise the before-and-after.

Scenario: An analyst at Yuvijen Stores Pvt Ltd must put four very different KPIs onto a single radar chart for the board’s quarterly review:

  • Revenue — measured in lakh of rupees (range 50–120)
  • NPS — measured on a 0–100 scale (range 30–60)
  • Fulfilment rate — percentage (range 85–98)
  • Delivery time — measured in hours (range 20–50; lower is better)

Without transformation, the radar chart is dominated by the largest-magnitude variable; standardisation puts them all on a comparable footing.

Deliverable: An Excel workbook and a Power BI report demonstrating the three transformations on the same dataset.

23.14.1 Step 1 — The Sample Data

Tipkpi_quarterly.csv (twelve months)
month revenue_lakh nps fulfilment_pct delivery_hours
2025-04 64 38 88 44
2025-05 70 39 89 42
2025-06 72 40 91 41
2025-07 75 41 92 40
2025-08 78 43 92 39
2025-09 80 44 93 38
2025-10 84 45 94 37
2025-11 88 46 94 36
2025-12 96 46 95 35
2026-01 92 47 95 34
2026-02 100 48 96 33
2026-03 110 50 97 32

23.14.2 Step 2 — Min-Max Normalisation in Excel

Open the dataset in Excel. In a new column for each variable, apply min-max normalisation so every value falls between 0 and 1. The Excel formula:

=(B2 - MIN($B$2:$B$13)) / (MAX($B$2:$B$13) - MIN($B$2:$B$13))

For delivery_hours, where lower is better, reverse the direction by computing 1 - normalized_value so that the highest score still represents the best performance.

After applying the formula to all four KPI columns, every transformed value is between 0 and 1, and the four KPIs are directly comparable on the same scale.

TipExcel Min-Max Worked Example
Metric Min Max Latest Value Normalised Latest
Revenue (lakh) 64 110 110 (110-64)/(110-64) = 1.000
NPS 38 50 50 (50-38)/(50-38) = 1.000
Fulfilment (%) 88 97 97 (97-88)/(97-88) = 1.000
Delivery (hrs)* 32 44 32 1 - (32-32)/(44-32) = 1.000

* Reversed because lower delivery time is better.

23.14.3 Step 3 — Min-Max Normalisation in Power BI

In Power BI Desktop, use DAX measures to compute the same normalisation dynamically:

Min Revenue = MIN('kpi_quarterly'[revenue_lakh])
Max Revenue = MAX('kpi_quarterly'[revenue_lakh])
Normalised Revenue =
DIVIDE(
    'kpi_quarterly'[revenue_lakh] - [Min Revenue],
    [Max Revenue] - [Min Revenue]
)

Repeat for NPS and fulfilment percentage. For delivery hours (lower is better):

Normalised Delivery =
1 - DIVIDE(
    'kpi_quarterly'[delivery_hours] - MIN('kpi_quarterly'[delivery_hours]),
    MAX('kpi_quarterly'[delivery_hours]) - MIN('kpi_quarterly'[delivery_hours])
)

The DAX-based approach is dynamic — when filters or slicers narrow the dataset, the min and max recalculate, and the normalisation adjusts accordingly. The Excel approach pins the bounds to a fixed range; the Power BI approach reflects whatever subset the user has selected.

23.14.4 Step 4 — Z-Score Standardisation in Excel

Excel ships with a built-in STANDARDIZE function that returns the z-score directly:

=STANDARDIZE(B2, AVERAGE($B$2:$B$13), STDEV.S($B$2:$B$13))

The formula returns (value − mean) / sample-standard-deviation — the canonical z-score. After applying it to all four KPI columns, each value is interpreted as standard deviations from the mean, regardless of original units.

For delivery hours (lower is better), simply negate the result so that a high z-score still represents better performance:

=-STANDARDIZE(B2, AVERAGE($B$2:$B$13), STDEV.S($B$2:$B$13))
TipExcel Z-Score Worked Example
Metric Mean Std Dev Latest Value Z-Score (Latest)
Revenue (lakh) 84.08 13.83 110 (110-84.08)/13.83 ≈ 1.87
NPS 43.92 3.83 50 (50-43.92)/3.83 ≈ 1.59
Fulfilment (%) 93.00 2.66 97 (97-93)/2.66 ≈ 1.50
Delivery (hrs)* 37.58 3.74 32 -(32-37.58)/3.74 ≈ 1.49

* Negated because lower delivery time is better.

The four values land in a comparable range — between roughly 1.5 and 2 — and any of them could now sit on a shared axis.

23.14.5 Step 5 — Z-Score Standardisation in Power BI

In DAX:

Avg Revenue = AVERAGE('kpi_quarterly'[revenue_lakh])
Std Revenue = STDEV.S('kpi_quarterly'[revenue_lakh])

ZScore Revenue =
DIVIDE(
    'kpi_quarterly'[revenue_lakh] - [Avg Revenue],
    [Std Revenue]
)

Repeat for NPS, fulfilment percentage, and delivery hours (with the sign reversed for delivery as in Excel).

The DAX STDEV.S function returns the sample standard deviation, matching Excel’s STDEV.S — the convention adopted by most modern statistical software.

23.14.6 Step 6 — Time-Based Aggregation in Power BI

Beyond scaling, the most common business transformation is rolling up time. The board does not want to see twelve monthly numbers; it wants four quarterly summaries.

In Power BI, an aggregated DAX measure does this without altering the underlying table:

Quarterly Revenue =
CALCULATE(
    SUM('kpi_quarterly'[revenue_lakh]),
    DATESINPERIOD(
        'Date'[Date],
        MAX('Date'[Date]),
        -3, MONTH
    )
)

The same idea expressed as a Power Query step: Group By → group by quarter, aggregate revenue using Sum. The Power Query approach materialises a new table; the DAX approach computes on the fly.

For a YTD or running total: Revenue YTD = TOTALYTD(SUM('kpi_quarterly'[revenue_lakh]), 'Date'[Date]). Power BI’s time-intelligence functions cover most aggregation patterns without custom DAX.

23.14.7 Step 7 — Visualise Before and After

Build a small Power BI page demonstrating the effect of standardisation:

  • Before: A radar (or polygon) chart with the four KPIs in their raw units. Revenue dominates the visual; the other three metrics hug the centre and are illegible.
  • After (Min-Max): The same radar with min-max-normalised values. All four KPIs span the same 0-to-1 range; the comparison is visually fair.
  • After (Z-Score): A multi-line chart of the four z-score series over twelve months. Each variable’s progression is comparable on a shared y-axis.

The before-and-after pair is the visual proof that transformation is itself a design decision, not just modelling preparation.

23.14.8 Step 8 — Connect to the Visualisation Layer

Several visualisation patterns from earlier modules now make sense as transformations applied at the BI layer:

  • Radar and parallel-coordinates charts require either min-max or z-score scaling — without it, the largest-magnitude variable dominates and the others vanish.
  • Heatmaps and small multiples of standardised values let the eye compare patterns across metrics with very different units.
  • Index charts (a line chart starting at 100 for everyone in the base period) are a min-max-style transformation expressed as a visual idiom.
  • Year-on-year and period-on-period comparisons are aggregation transformations that Power BI’s time-intelligence DAX functions render almost free of effort.

The hands-on demonstrates that transformation is rarely an upstream-only concern in modern BI work; it is built into Excel formulas, DAX measures, and Power Query steps that the dashboard developer applies as part of the visualisation pipeline.

TipFiles and Screen Recordings

Excel workbook (yuvijen-transformations.xlsx), Power BI file (yuvijen-transformations.pbix), the source kpi_quarterly.csv, and screen recordings of the min-max, z-score, and aggregation walk-throughs in both tools will be embedded here.


Summary

Concept Description
Foundations
Why Data Transformation Matters Variables on incompatible scales, skewed distributions, and wrong granularity all need transformation
Six Transformation Families
Normalisation Family Scale a variable to a fixed range, typically 0 to 1
Standardisation Family Centre and scale by distributional statistics, typically mean and standard deviation
Mathematical Transformation Family Apply a non-linear function to reshape distribution, reduce skew, or stabilise variance
Aggregation Family Change the level of granularity from fine-grained records to summary statistics
Encoding Family Convert categorical values into numerical representations a model can use
Discretisation Family Convert continuous variables into ordered categories or bins
Normalisation Methods
Min-Max Scaling Linear scaling to bounded range; sensitive to outliers
Range Scaling Generalised range scaling to any interval
Decimal Scaling Divide by a power of ten until the largest absolute value is below one
Unit-Norm L1 Scale each row to sum to one across its features; useful for proportional features
Unit-Norm L2 Scale each row to unit Euclidean length; basis for cosine similarity
Standardisation Methods
Z-Score Standardisation Subtract mean and divide by standard deviation; default for parametric methods
Robust Scaler Subtract median and divide by interquartile range; outlier-resistant
MaxAbs Scaling Divide by largest absolute value; preserves zero entries for sparse data
Mean Centring Subtract mean only; preserve unit while removing absolute level
Mathematical Transformations
Logarithmic Transformation Reduce right skew in positive data; income, prices, counts
Square Root Transformation Reduce mild right skew in positive data such as counts
Reciprocal Transformation Reduce heavy right skew; physical rates and ratios
Box-Cox Transformation Power transformation that finds the lambda producing the most normal distribution
Yeo-Johnson Transformation Extension of Box-Cox to non-positive values; mixed sign data
Quantile (Rank) Transformation Replace each value with its rank or quantile to map any distribution to uniform or normal
Power Transformation Generic power transformation for linearising relationships in regression
Aggregation Patterns
Group-By Aggregation Group records by keys and apply summary functions like sum, mean, median, count
Time-Based Roll-Ups Aggregate fine-grained time-stamped data into daily, weekly, monthly summaries
Hierarchical Aggregation Roll up across a hierarchy from leaf entities to higher levels
Window Functions Compute aggregates over sliding or expanding windows; rolling means and YTD sums
Pivoting Move from long-format to wide-format aggregations; rows become matrix cells
Pre-Aggregated Cubes Precomputed multidimensional summaries served by OLAP engines for fast analysis
Categorical Encoding
One-Hot Encoding Each category becomes its own binary column; suitable for unordered with moderate cardinality
Label (Ordinal) Encoding Each category gets an integer code; appropriate for ordinal where order matters
Target (Mean) Encoding Replace category by mean of target for that category; powerful but risks leakage
Frequency Encoding Replace category by its frequency in the dataset; cheap for high cardinality
Hash Encoding Apply hash function and bucket the result; for very high cardinality with memory constraints
Binary Encoding Convert integer label to its binary representation across multiple columns
Embedding Encoding Learned low-dimensional vector representation; standard in deep learning
Discretisation Methods
Equal-Width Binning Divide range into bins of equal width; sensitive to outliers and skew
Equal-Frequency Binning Divide data into bins with equal counts; robust to skew
Custom Bins Bins chosen to align with business or regulatory categories
Decision-Tree-Based Binning Use a decision tree to find splits that best separate the target
MDLP Binning Information-theoretic binning that balances bin count against fit
Feature Engineering Operations
Interaction Terms Products or ratios of two variables that capture combined effect
Polynomial Features Powers of a variable to capture curvature in linear models
Date and Time Decomposition Day-of-week, month, hour-of-day, holiday flag, week-of-year features
Lagged and Lead Features Prior-period or future-period values for time-series modelling
Rolling-Window Aggregates Trailing mean, max, count over a recent window for time-series
Cross-Record Features Per-entity aggregates of recent transactions or events
Common Pitfalls
Scaling Test Data with Test Statistics Pitfall of computing scaling parameters on the test set, leaking information from test to training
One-Hot Explosion Pitfall of one-hot encoding a high-cardinality variable producing thousands of sparse columns
Target Encoding Without CV Pitfall of computing target mean per category over the entire training set, leaking target into feature
Log of Zero or Negative Pitfall of applying log to data containing zero or negative values without offset or generalisation
Min-Max Outlier Sensitivity Pitfall of a single extreme value compressing the rest of the variable near zero
Ordinal Encoding of Nominal Pitfall of assigning integers to nominal categories that the model interprets as ordered
Aggregating Before Inspecting Pitfall of computing means and totals before checking for missing values or pipeline gaps
Aggregating Heterogeneous Groups Pitfall of a single mean across populations that differ in important ways; Simpson-class errors
Discretisation Discards Signal Pitfall of binning too coarsely and losing the predictive relationship
Forgotten Reversal Pitfall of applying a transformation for modelling and forgetting to invert it for reporting
Train-Production Pipeline Drift Pitfall of training and production pipelines scaling or encoding differently
Magic Numbers in Bin Boundaries Pitfall of hard-coded bin boundaries not aligned with business or regulatory definitions
Over-Engineering Pitfall of producing dozens of variants of every variable when a small thoughtful set is more useful