22  Data Cleaning Techniques

22.1 Why Data Cleaning Matters

Eighty per cent of an analytics project is data preparation is not a saying because data preparation is interesting. It is a saying because it is true.

Real-world data arrives messy. Fields are missing, duplicates lurk, formats disagree, units are inconsistent, and outliers — some genuine, some erroneous — skew every summary statistic. Data cleaning is the disciplined process of turning this raw input into a dataset that can be analysed with confidence. It is the unglamorous work on which every model, dashboard, and report ultimately rests.

The visualisation consequences are everywhere. A dashboard that shows duplicated customers will overstate revenue. A bar chart whose y-axis is dominated by an erroneous outlier will compress the meaningful comparisons into a flat band. A choropleth map shaded by counts that should have been rates will look impressive and tell the wrong story. Dashboards in Tableau, Power BI, Looker, or Qlik inherit every cleaning failure in the data feeding them — and the visual confidence of the chart obscures the underlying problem. Cleaning is therefore not a back-office concern; it is a precondition for every chart the firm draws.

The classic survey by Erhard Rahm & Hong Hai Do (2000) catalogues the recurring families of cleaning problems and the approaches that have been developed to address them. The conceptual reorganisation that turned the practitioner art into a teachable discipline came with Hadley Wickham (2014) and the tidy data principles, which are now embedded in the structure of the modern Python and R analytical stacks — and increasingly in the data-preparation layers of Tableau Prep, Power Query, and dbt.

22.2 The Data Cleaning Process

flowchart LR
    A["1. Profile<br>and assess"] --> B["2. Standardise<br>structure"]
    B --> C["3. Resolve<br>inconsistencies"]
    C --> D["4. Handle<br>missing values"]
    D --> E["5. Detect and treat<br>outliers"]
    E --> F["6. Remove<br>duplicates"]
    F --> G["7. Validate<br>and type"]
    G --> H["8. Document<br>and version"]
    H -.-> A
    style A fill:#fce4ec,stroke:#AD1457
    style B fill:#fff3e0,stroke:#EF6C00
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e3f2fd,stroke:#1976D2
    style E fill:#ede7f6,stroke:#4527A0
    style F fill:#e8f5e9,stroke:#388E3C
    style G fill:#f3e5f5,stroke:#6A1B9A
    style H fill:#eceff1,stroke:#455A64

A pragmatic eight-step cleaning workflow:

  • Profile and assess: Understand the structure, distributions, missingness, and oddities of the dataset before changing anything.
  • Standardise structure: Reshape the data into a tidy form — one observation per row, one variable per column.
  • Resolve inconsistencies: Case, whitespace, encoding, units, and category labels.
  • Handle missing values: Deletion or imputation, chosen for the missingness mechanism.
  • Detect and treat outliers: Distinguish genuine extreme values from data errors.
  • Remove duplicates: Exact and fuzzy duplicates of records and entities.
  • Validate and type: Enforce constraints, ranges, formats, and correct data types.
  • Document and version: Record what was changed, why, and produce reproducible cleaning code.

A clean dataset that no one knows how to reproduce is a dataset that will quietly diverge from the source over time.

22.3 Tidy Data Principles

Hadley Wickham (2014) set out three structural principles for analytical-ready datasets, which together form the tidy data discipline:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

A dataset that obeys all three is tidy; one that does not is messy. Tidy data is not pretty data — it is data structured so that the computer (and the analyst) can manipulate, summarise, and visualise it without first untangling it.

Common forms of messiness Wickham identifies:

  • Column headers are values, not variable names: A wide table with columns like 2022, 2023, 2024 should be reshaped to two columns, year and value.
  • Multiple variables stored in one column: A column that mixes a date and a category needs to be split.
  • Variables stored in both rows and columns: Mixed wide-and-long shapes that require pivoting.
  • Multiple types of observational unit in one table: Customer attributes, order attributes, and product attributes that should live in separate tables.
  • A single observational unit stored across multiple tables: Data fragmented across files that should be a single unified table.

The pivot operations in tidyverse and pandas — pivot_longer, pivot_wider, melt, pivot_table — exist precisely to turn messy shapes into tidy ones.

22.4 Handling Missing Values

Missing values are the most common cleaning problem. Their treatment depends on why they are missing, which is rarely obvious from the data alone.

TipThe Three Mechanisms of Missingness
Mechanism Meaning Example Treatment Implication
MCAR (Missing Completely at Random) Missingness has no relation to any value, observed or unobserved A sensor fails at random times Listwise deletion is unbiased but loses sample size
MAR (Missing at Random) Missingness depends on observed variables but not on the missing value itself Older respondents skip an income question Multiple imputation can recover unbiased estimates
MNAR (Missing Not at Random) Missingness depends on the missing value itself High earners refuse to disclose income Hard problem; requires modelling the missingness mechanism explicitly
TipCommon Treatments
Method When to Use Caveat
Listwise Deletion MCAR, plenty of data Reduces sample size; biased under MAR or MNAR
Pairwise Deletion When complete-case analysis loses too many rows Different statistics computed on different samples
Mean / Median Imputation Quick exploratory work Distorts variance and correlations
Mode Imputation (categorical) Categorical fields with few missing Distorts category proportions
Forward / Backward Fill Time-series with persistence Inappropriate for non-persistent variables
Regression Imputation When a related variable predicts the missing one Ignores prediction uncertainty
K-Nearest-Neighbour Imputation Mixed-type data with similar records Distance metric and k must be chosen carefully
Multiple Imputation (MICE) MAR data, formal inference required Computationally heavier; produces several imputed datasets
Indicator Variable Missingness itself may be informative Add a flag column alongside imputation

The right choice depends on the missingness mechanism, the analytical goal, and the cost of error. For exploratory dashboards, median imputation may be acceptable; for regulatory or scientific work, multiple imputation is the modern standard.

22.5 Detecting and Handling Outliers

An outlier is an observation that lies far from the bulk of the data. Outliers can be genuine extreme values (a billionaire in a customer base) or data errors (a typed extra zero). Treatment must distinguish the two.

Detection methods:

  • IQR Rule: Values below \(Q_1 - 1.5 \times IQR\) or above \(Q_3 + 1.5 \times IQR\). Robust, distribution-free, the default behind boxplot whiskers.

  • Z-Score: Values more than three standard deviations from the mean. Assumes approximate normality.

  • Modified Z-Score (Median Absolute Deviation): Robust alternative for non-normal data.

  • Tukey’s Test: A formal extension of the IQR rule.

  • Mahalanobis Distance: Multivariate detection of points far from the multivariate centre.

  • Isolation Forest, Local Outlier Factor (LOF), DBSCAN: Machine-learning methods for high-dimensional data.

Treatment options:

  • Investigate first: A genuine outlier is information; an erroneous one is noise. Always look at the records before acting.
  • Correct if erroneous: Fix typos, unit errors, or pipeline bugs upstream.
  • Cap (winsorise): Replace values beyond a threshold with the threshold. Preserves sample size; bounds influence on summary statistics.
  • Transform: Log, square root, or Box-Cox transformations reduce the influence of long-tailed extremes.
  • Separate analysis: Analyse outliers as their own segment.
  • Remove only as last resort: Deletion is the most aggressive treatment and should require a documented reason.

The cardinal rule: never silently delete an outlier. Every removal should have a recorded justification.

22.6 Removing Duplicates

Duplicates come in two flavours:

  • Exact duplicates: Identical rows produced by data-entry repetition or pipeline mistakes. Easy to identify with duplicated() (R), drop_duplicates() (pandas), or DISTINCT in SQL.

  • Fuzzy duplicates (entity resolution): The same real-world entity represented by two or more records that differ in spelling, formatting, or completeness. “Acme Corp Ltd”, “ACME CORPORATION LTD.”, and “Acme Corp.” are typically the same firm.

Fuzzy-duplicate detection techniques:

  • String similarity: Levenshtein, Jaro-Winkler, n-gram, or token-set ratios.
  • Phonetic algorithms: Soundex, Metaphone, Double Metaphone for names that sound similar.
  • Standardisation first: Normalise case, trim whitespace, expand abbreviations, before comparing.
  • Blocking and clustering: For very large datasets, only compare records sharing a key prefix or block.
  • Probabilistic record linkage: Fellegi-Sunter and modern Bayesian variants quantify the probability that two records are the same.
  • Specialist tools: dedupe.io, Splink, RecordLinkage in R; many commercial master-data-management platforms.

Always retain a mapping from the original duplicates to the resolved master record. Without it, the deduplication is irreproducible.

22.7 Standardisation and Inconsistency Resolution

The most common low-glamour cleaning issues:

  • Case Normalisation: Mumbai, MUMBAI, mumbai are the same city; the analyst should choose a canonical case and stick to it.

  • Whitespace Trimming: Leading, trailing, and double internal spaces in text fields. ” Acme “ is not the same as “Acme” until trimmed.

  • Encoding Resolution: Files arriving in mixed UTF-8, Latin-1, or local encodings produce mojibake (‘) unless decoded consistently.

  • Unit Standardisation: Converting kilograms and grams, lakhs and crores, US and metric units to a single base.

  • Currency Normalisation: Converting all monetary values to a single currency at a documented exchange rate, with the original currency preserved in a sidecar field.

  • Date and Time Normalisation: Reading mixed DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD, and timezone-bearing strings into a single ISO 8601 representation.

  • Categorical Recoding: Male, M, m, Mr in a single field reduced to a documented coded set.

  • Phone, Email, and Identifier Validation: Format checks against canonical patterns (E.164 for phones, RFC standard for emails, regex for tax identifiers).

  • Address Standardisation: Postal-code expansion, town-name normalisation, country-code resolution; specialised libraries (libpostal, USPS, India Post) handle this.

  • Free-Text Cleaning: Stripping HTML tags, controlling for line breaks, normalising punctuation and Unicode forms before any text analytics.

22.8 Data Validation and Type Conversion

After cleaning, before analysis, every analytical dataset should pass a defined set of validation checks:

  • Type Constraints: Numeric columns are numeric, dates are dates, booleans are booleans.
  • Range Constraints: Ages between 0 and 120, percentages between 0 and 100, quantities non-negative.
  • Set Membership: Categorical fields take only values from a defined list.
  • Cross-Field Constraints: End date is after start date; total equals the sum of components; ratios equal the ratio of their parts.
  • Foreign Key Integrity: Every customer ID in an order table exists in the customer master.
  • Cardinality Constraints: A primary key is unique; required fields are not missing.

Modern tooling — Great Expectations, Soda, dbt tests, pandera, validate (R) — turns these into automated checks that fail the pipeline when violated. The discipline is to write the validations first and let them gatekeep every refresh.

22.9 Tools for Data Cleaning

TipCommon Tools
Stack Tools Use
Python pandas, numpy, polars, fuzzywuzzy, recordlinkage, missingno, ydata-profiling General-purpose cleaning
R dplyr, tidyr, janitor, naniar, mice, validate, OpenRefine tidyverse-aligned cleaning workflows
SQL Window functions, CASE, CTEs, dbt tests In-warehouse cleaning
Specialist DQ Talend Data Quality, Informatica DQ, Ataccama, Great Expectations, Soda Enterprise data-quality platforms
Interactive OpenRefine, Trifacta (Alteryx Designer Cloud), Power Query, Tableau Prep Visual exploration and transformation
Entity Resolution dedupe.io, Splink, Senzing, commercial MDM platforms Fuzzy matching at scale
Text Cleaning spaCy, NLTK, regex libraries, ftfy, libpostal Text and address normalisation

The right tool depends on the scale, the language preference of the team, and whether the cleaning needs to be interactive (exploratory) or codified (reproducible). Most mature teams use a combination — SQL or dbt for warehouse-level cleaning, Python or R for analytical work, and a DQ platform for governance and monitoring.

22.10 Automation and Reproducibility

Manual cleaning is a research debt the firm pays back forever. The disciplines that turn cleaning into a sustainable practice:

  • Codified pipelines: Cleaning steps live in version-controlled scripts or dbt models, not in interactive notebooks.
  • Idempotency: Running the cleaning twice produces the same result as running it once.
  • Tested validations: Validation checks run automatically on every refresh and fail loudly when violated.
  • Lineage and provenance: Every cleaned field traces back to its source field and the transformation that produced it.
  • Change logs: Cleaning rules evolve; the change log records what changed, when, and why.
  • Notebook-to-pipeline migration: An analyst’s exploratory notebook is the start of a cleaning rule, not its end. Promote it to a pipeline.
  • Continuous monitoring: Quality scorecards run on every refresh and trend over time, so degradation is visible.

A reproducible pipeline that any analyst can run, inspect, and modify is the operational form of a clean dataset.

22.11 Cleaning for Visualisation

Several cleaning decisions matter especially for visualisation, because their consequences are visible in the chart itself rather than buried in a model coefficient.

  • Outlier handling shapes the axis: A single un-cleansed extreme value can collapse the rest of a chart into a thin band near the origin. For dashboards, capping or annotating outliers is often more useful than removing them — the audience needs to see that the outlier exists, not have it silently disappear.

  • Duplicate resolution shapes every count: Customer-360 dashboards in Tableau or Power BI overstate revenue, headcount, and distinct counts whenever fuzzy duplicates have not been resolved. A KPI card showing Active Customers: 1.4 million may carry an unspoken multiplier of 1.4 across the firm’s master data.

  • Categorical recoding shapes the legend: A chart that displays Male, M, m, Mr as four separate categories tells the audience the data is dirty before any analysis begins. Standardise category labels before they enter a chart’s colour palette or grouping axis.

  • Date and time normalisation shape every time series: A line chart with mixed DD/MM/YYYY and MM/DD/YYYY dates produces gaps, double-counts, and apparent seasonality that does not exist. ISO-8601 normalisation is the single most consequential cleaning step for time-series visuals.

  • Currency and unit standardisation shape every comparison: A bar chart comparing revenue across regions in unstandardised currencies or magnitudes (lakhs vs millions) will produce false rankings and misleading proportions.

  • Missingness affects what the dashboard can claim: A dashboard that silently imputes the mean for missing values misleads the audience about confidence. Where missingness is meaningful, surface it as a separate visual (a coverage indicator alongside each KPI), not as imputed silence.

  • Tooling choice matters: Tableau Prep, Power Query, and dbt all expose cleaning as part of the analytical pipeline rather than as a hidden upstream process. Codifying cleaning in these tools makes the rules visible to dashboard developers and auditable by data stewards.

A general rule for visualisation: the cleaning that the audience cannot see should still be visible to the analyst. Document every transformation in the dashboard’s data dictionary, surface a data quality indicator on every dashboard refresh, and treat cleaning rules as part of the published artefact alongside the chart itself.

22.12 Common Pitfalls

  • Cleaning Without Profiling First: Reaching for transformations before understanding the structure, distributions, and missingness of the data.

  • Silent Deletion: Removing rows and columns without recording why, leaving a “clean” dataset whose history is lost.

  • Mean Imputation by Default: Filling missing values with the mean across the board, distorting variance, correlation, and distribution.

  • Treating MNAR as MAR: Imputing data whose missingness is informative as if it were random, baking the bias into the analysis.

  • Removing All Outliers: Treating every extreme value as an error, eliminating the genuinely informative tail of the distribution.

  • Dedup Without Mapping: Removing duplicates without recording which records were merged into which master, breaking the link to source.

  • Inconsistent Standardisation: Standardising on Monday and undoing it on Friday because the upstream pipeline overwrites the changes.

  • Cleaning Once, Trusting Forever: A dataset cleaned at project start that is never re-validated as new data arrives.

  • Manual Cleaning in Spreadsheets: One-off fixes typed into a spreadsheet that nobody can reproduce next month.

  • Ignoring the Mechanism: Choosing imputation methods based on convenience rather than on the missingness mechanism.

  • Validation as Afterthought: Writing validation checks only after a quality incident, when the discipline is to write them first.

  • Ignoring Privacy in Cleaning: Building cleaning pipelines that quietly retain personal data fields no longer needed for analysis.

  • Over-Cleaning: Smoothing every rough edge and erasing the variability that the analysis was supposed to study.

22.13 Illustrative Cases

The following cases illustrate cleaning decisions in practice. They describe common situations and the reasoning behind the design choices.

Customer Master in a Retail Bank

A retail bank’s customer master contains 1.4 records per real-world customer due to free-text name entry across a decade of branch operations. The team builds a deduplication pipeline using standardisation (uppercase, whitespace trim, abbreviation expansion), Jaro-Winkler string similarity on name plus exact match on national identifier, and clustering by address postcode. A retained mapping table records which legacy IDs were merged into which master record. The deduplication ratio falls from 1.4 to 1.05 within six months and downstream reporting across marketing, risk, and finance begins to agree.

Sensor Data in a Manufacturing Line

A manufacturing line’s vibration sensor produces a stream that includes occasional spikes that look like outliers. Investigation finds three causes: genuine machine events (kept), brief sensor failures (capped), and a re-instrumentation period when a new sensor was installed without adjusting the calibration constant (corrected with a documented offset, applied retroactively to the affected period). The cleaning pipeline records each treatment by source so the corrections can be audited.

Survey Data with Missing Income

A consumer-research survey has 18 per cent missing values in the income question. Profiling reveals that older and higher-income respondents are more likely to skip the question (MAR with respect to age and education, possibly MNAR with respect to income itself). The team uses multiple imputation (MICE) on the MAR-justified portion and adds an indicator variable for missingness so the analysis can detect and report whether the missingness is itself informative.

Financial Statements in Different Reporting Standards

An analyst aggregates financial statements from twenty firms reporting under different standards (Indian Accounting Standards, IFRS, US GAAP) and different fiscal-year ends. The cleaning pipeline standardises currency to USD using documented exchange rates, fiscal years to calendar years using documented mappings, accounting categories to a single chart of accounts, and reporting period to monthly. The original fields are preserved; the standardised fields are added alongside.

A Tidy-Data Refactor

A spreadsheet of student grades has columns named Test 1, Test 2, Test 3, with grades in the cells. The analyst’s regression model expects one row per student-test observation. The team applies a tidy-data pivot to reshape the spreadsheet from wide to long, producing columns student, test, grade. The model now runs without further wrangling and the same dataset can drive any number of subsequent analyses.


22.14 Hands-On Exercise: Cleaning Workflow in Power Query

Aim: Apply the eight-step cleaning workflow from this chapter end-to-end, entirely inside Power Query, in both Power BI and Excel. Where Chapter 6’s hands-on focused on assessment (Column Profile, Quality, Distribution), this hands-on focuses on remediation — actually fixing the issues.

Scenario: The same 200-row customer master from Chapter 6 — for Yuvijen Stores Pvt Ltd — but this time the goal is to produce a clean, governed, refresh-safe table that the reporting layer can rely on.

Deliverable: A Power BI file (and equivalent Excel workbook) with two queries — customers_raw and customers_clean — and a documented Applied Steps pipeline that turns one into the other.

22.14.1 Step 1 — Profile First (Recap)

Before any cleaning, the analyst confirms the issues using the diagnostic features already covered in Chapter 6:

  1. Load customers.csv via Home → Get Data → Text/CSV.
  2. Click Transform Data to enter Power Query.
  3. Enable View → Column quality, Column distribution, Column profile and switch to Profiling based on entire dataset.

Issues to remediate, recap from Chapter 6: missing emails, fuzzy duplicates, mixed-case city values, invalid dates, inconsistent gender codes, implausible ages, and a small number of missing signup dates.

22.14.2 Step 2 — Standardise Structure

Most well-formed CSVs are already in tidy shape — one row per observation, one variable per column. If not, Power Query offers Transform → Unpivot Columns and Use First Row as Headers to reshape the data. For this dataset, no reshaping is needed; mark the step as No-op and proceed.

22.14.3 Step 3 — Resolve Inconsistencies

In Power Query:

  • Trim and Clean text columns: Right-click each text column → Transform → Trim (removes leading/trailing whitespace) and Clean (removes non-printing characters).
  • Capitalise city names: Right-click cityTransform → Capitalize Each Word. The city values Mumbai, MUMBAI, mumbai now collapse to a single Mumbai.
  • Replace inconsistent gender codes: Right-click genderReplace Values. Map M and MaleM; F, f, FemaleF. Use successive Replace Values steps for each substitution; each one becomes a recorded step in the Applied Steps pane.
  • Standardise phone format: Use Add Column → From Text → Extract → Length combined with regex-style operations to normalise to a 10-digit pattern. Power Query’s Replace Values with the Match entire cell contents option is sufficient for most cases.

22.14.4 Step 4 — Handle Missing Values

Power Query treats null and empty string differently. Reconcile both:

  • Replace empty strings with null: Right-click each text column → Replace Values → Find: (leave empty), Replace with: null (toggle the Match entire cell contents option). This makes downstream missing-value treatment consistent.
  • Add a flag column: Add Column → Conditional Column → if [email] is null then “Missing Email” else “Has Email” (and similarly for phone, signup_date). The flag preserves missingness information after any subsequent imputation.
  • Imputation choices: For analytical reporting, leave nulls as null and let the dashboard surface them as a Coverage indicator. For the rare case where imputation is appropriate, Transform → Replace Values → Replace null with [value] records the substitution as a versioned step.

The cleaning pipeline is now explicit about which fields had what missingness, rather than silently substituting.

22.14.5 Step 5 — Detect and Treat Outliers

For numeric and date fields:

  • Detect: Use the Column Profile to find min, max, and value-frequency histograms. The dob column shows a 2010 birth date — implying a 16-year-old customer, which violates the firm’s adult-only policy.
  • Filter rows: Click the filter icon on dobDate Filters → Before… and set the threshold (e.g. 2008-01-01).
  • Cap (winsorise) as an alternative: Add Column → Conditional Column to cap any total-spend value above the 99th percentile to the 99th-percentile value, preserving the row but limiting its influence.
  • Document: Each filter or cap is recorded as an Applied Step. A short note in the step name (#"Removed minor customers") documents the rationale for future reviewers.

22.14.6 Step 6 — Remove Duplicates

Two layers of deduplication:

  • Exact duplicates: Home → Remove Rows → Remove Duplicates with all columns selected. Rows that are byte-for-byte identical are eliminated.
  • Composite-key duplicates: To find fuzzy duplicates that share (name, dob, phone) but differ on case or whitespace, Add Column → Index Column to keep a record of original ordering, then group by the composite key (Transform → Group By) and either keep the most complete row using Max aggregations across the other columns or pass each group through a First Non-Null aggregation.
  • Fuzzy matching: For free-text name fields, Power Query offers a Use fuzzy matching option in Merge Queries (Power BI 2020+). Match the table to itself with a similarity threshold of 0.85 to surface candidate fuzzy duplicates without immediately deleting them.

22.14.7 Step 7 — Validate and Type

The final structural pass:

  • Change Type with Locale: Right-click each column → Change Type → Using Locale. Choose Date with locale English (India) for date columns to handle the local format. Choose Whole Number or Decimal Number for numeric columns. Power Query records the type change as a step.
  • Add validation flags: Add Column → Conditional Column to flag rows that violate business rules (e.g. if [TotalSpend] < 0 then "Negative Spend" else "OK"). Filter to OK before loading, or keep the flag and surface invalid rows to a separate quality-review query.
  • Schema validation: Before clicking Close & Apply, scan every column header in the Applied Steps panel to confirm each column has the expected type indicator (123 for numeric, A_C for text, calendar icon for date).

22.14.8 Step 8 — Document and Version

The Applied Steps pane in Power Query is the documentation of the cleaning pipeline. Two practices turn it into a maintained artefact:

  • Rename steps to describe their purpose: instead of #"Replaced Value", rename to #"Standardised Gender to M/F". Right-click any step → Rename.
  • Annotate non-obvious steps with comments via Properties (right-click the step → Properties → enter a description). The descriptions appear as tooltips when other team members open the file later.

Save the Power BI file to a version-controlled location (SharePoint, OneDrive with version history, or Git via the export-to-PBIP feature).

22.14.9 Step 9 — The Excel Equivalent

Excel ships with the same Power Query engine. The workflow is virtually identical:

  1. In Excel, Data → Get Data → From Text/CSV (or any other source).
  2. The Power Query editor opens. Every step described above (Trim, Clean, Replace Values, Group By, Remove Duplicates, Change Type with Locale, Add Conditional Column) is available in the same ribbon.
  3. Close & Load writes the cleaned table to a worksheet, refreshable on demand.

The same Applied Steps pipeline travels between Excel and Power BI through copy-and-paste of the underlying M code (visible via View → Advanced Editor). The cleaning logic is portable.

22.14.10 Step 10 — Connect to the Visualisation Layer

The cleaned table is the input to every dashboard the firm builds on this customer master. A few specific visualisation consequences:

  • A KPI tile of Active Customers now shows a defensible number rather than a duplicated count.
  • A choropleth of customer density by city now uses standardised city names rather than three spelling variants.
  • A trend chart of signups over time now reads cleanly because invalid dates have been filtered.
  • A coverage indicator in the dashboard footer (e.g. Email coverage: 94 %, Phone coverage: 96 %) surfaces the residual missingness honestly, without imputing it away.

The hands-on closes with the discipline introduced in Chapter 6: the cleaning the audience cannot see should still be visible to the analyst. Every Applied Step is part of the published artefact.

TipFiles and Screen Recordings

Power BI file (yuvijen-customer-clean.pbix), the equivalent Excel workbook (yuvijen-customer-clean.xlsx), the original customers.csv, and screen recordings of the full ten-step Power Query workflow in both tools will be embedded here.


Summary

Concept Description
Foundations
Why Data Cleaning Matters Real-world data arrives messy; cleaning is the unglamorous foundation under every model and dashboard
Eight-Step Cleaning Process
Profile and Assess Understand structure, distributions, missingness, and oddities before changing anything
Standardise Structure Reshape into a tidy form: one observation per row, one variable per column
Resolve Inconsistencies Resolve case, whitespace, encoding, units, and category labels
Handle Missing Values Choose deletion or imputation based on the missingness mechanism
Detect and Treat Outliers Distinguish genuine extreme values from data errors before treating
Remove Duplicates Identify exact and fuzzy duplicates and resolve to a master record
Validate and Type Enforce constraints, ranges, formats, and correct data types
Document and Version Record what was changed, why, and produce reproducible cleaning code
Tidy Data Principles
Tidy Data Principles Wickham's three structural principles for analytical-ready datasets
Each Variable a Column Each variable forms a column
Each Observation a Row Each observation forms a row
Each Unit a Table Each type of observational unit forms a table
Pivot Operations pivot_longer, pivot_wider, melt, pivot_table to reshape between wide and long
Mechanisms of Missingness
MCAR Missing Completely at Random; missingness has no relation to any value
MAR Missing at Random; missingness depends on observed variables
MNAR Missing Not at Random; missingness depends on the missing value itself
Missing-Value Treatments
Listwise Deletion Drop any row with a missing value; unbiased under MCAR but loses sample
Pairwise Deletion Use complete cases per pair of variables; statistics computed on different samples
Mean and Median Imputation Quick exploratory fix; distorts variance and correlations
Mode Imputation Fill categorical missing with most frequent category; distorts proportions
Forward and Backward Fill Carry the previous or next value forward in time-series data
Regression Imputation Predict the missing variable from a related observed variable
K-Nearest-Neighbour Imputation Use similar records to fill the missing value; sensitive to distance metric
Multiple Imputation (MICE) Modern multivariate imputation that produces several imputed datasets and combines results
Indicator Variable Add a flag column when missingness itself may be informative
Outlier Detection Methods
IQR Rule Values beyond Q1 minus 1.5 IQR or above Q3 plus 1.5 IQR; robust and distribution-free
Z-Score Values more than three standard deviations from the mean; assumes normality
Modified Z-Score Robust alternative based on median absolute deviation
Mahalanobis Distance Multivariate detection of points far from the multivariate centre
Isolation Forest Machine-learning method for high-dimensional outlier detection
Local Outlier Factor Density-based outlier detection in high-dimensional data
Outlier Treatments
Investigate Outliers First A genuine outlier is information; an erroneous one is noise; always look first
Correct if Erroneous Fix typos, unit errors, or pipeline bugs upstream where possible
Cap or Winsorise Replace values beyond a threshold with the threshold to bound influence
Transform Variables Log, square root, or Box-Cox to reduce influence of long-tailed extremes
Separate Outlier Analysis Analyse the outliers as their own segment rather than discarding
Remove Outliers as Last Resort Deletion is the most aggressive treatment; require a documented reason
Duplicate Resolution
Exact Duplicates Identical rows produced by data-entry repetition or pipeline mistakes
Fuzzy Duplicates Same real-world entity represented by records that differ in spelling or formatting
String Similarity Levenshtein, Jaro-Winkler, n-gram, or token-set ratios for fuzzy matching
Phonetic Algorithms Soundex, Metaphone, Double Metaphone for names that sound similar
Standardisation First Normalise case, trim whitespace, expand abbreviations before comparing
Blocking and Clustering For very large datasets, compare only records sharing a key prefix or block
Probabilistic Record Linkage Fellegi-Sunter and Bayesian variants quantify the probability records are the same
Standardisation and Inconsistency
Case Normalisation Choose a canonical case and stick to it: Mumbai, MUMBAI, mumbai resolve to one form
Whitespace Trimming Strip leading, trailing, and double internal spaces from text fields
Encoding Resolution Decode files arriving in mixed UTF-8, Latin-1, or local encodings consistently
Unit Standardisation Convert kilograms and grams, lakhs and crores, US and metric to a single base
Currency Normalisation Convert all monetary values to a single currency at a documented exchange rate
Date and Time Normalisation Read mixed date formats into a single ISO 8601 representation
Categorical Recoding Reduce inconsistent labels like Male, M, m, Mr to a documented coded set
Phone Email Identifier Validation Format checks against canonical patterns: E.164, RFC email, regex for tax IDs
Address Standardisation Postal-code expansion, town-name normalisation, country-code resolution with libpostal-class libraries
Free-Text Cleaning Strip HTML, control line breaks, normalise punctuation and Unicode before text analytics
Validation Checks
Type Constraints Numeric columns are numeric, dates are dates, booleans are booleans
Range Constraints Ages 0 to 120, percentages 0 to 100, quantities non-negative
Set Membership Categorical fields take only values from a defined list
Cross-Field Constraints End date after start date, total equals sum of parts, ratios equal ratios
Foreign Key Integrity Every foreign key value in a child table exists in the parent
Cardinality Constraints Primary keys unique; required fields not missing
Automation and Reproducibility
Codified Pipelines Cleaning steps live in version-controlled scripts or dbt models, not in notebooks
Idempotency Running the cleaning twice produces the same result as running it once
Tested Validations Validation checks run automatically on every refresh and fail loudly when violated
Lineage and Provenance Every cleaned field traces back to its source field and the transformation
Change Logs Cleaning rules evolve; the change log records what changed, when, and why
Notebook-to-Pipeline Migration Promote exploratory notebook fixes to versioned reproducible pipelines
Continuous Monitoring Quality scorecards run on every refresh and trend over time so degradation is visible
Common Pitfalls
Cleaning Without Profiling Pitfall of reaching for transformations before understanding structure and missingness
Silent Deletion Pitfall of removing rows and columns without recording why
Mean Imputation by Default Pitfall of filling missing values with the mean across the board, distorting structure
Treating MNAR as MAR Pitfall of imputing informative missingness as if it were random
Removing All Outliers Pitfall of treating every extreme value as error and eliminating informative tails
Dedup Without Mapping Pitfall of removing duplicates without recording which legacy IDs map to which master
Inconsistent Standardisation Pitfall of standardising on Monday and undoing it on Friday because upstream overwrites
Cleaning Once Trusting Forever Pitfall of cleaning once at project start and never re-validating as new data arrives
Manual Cleaning in Spreadsheets Pitfall of one-off fixes typed into a spreadsheet that nobody can reproduce next month
Ignoring the Mechanism Pitfall of choosing imputation methods based on convenience rather than mechanism
Validation as Afterthought Pitfall of writing validation checks only after a quality incident
Ignoring Privacy in Cleaning Pitfall of cleaning pipelines that quietly retain personal data no longer needed
Over-Cleaning Pitfall of smoothing every rough edge and erasing the variability the analysis was meant to study