6 Data Quality Assessment

6.1 Why Data Quality Matters

Garbage in, garbage out is the oldest aphorism in computing, and the one most often relearned by analytics teams.

Every analytical result, every dashboard, every machine-learning model rests on the data underneath it. When that data is wrong, late, incomplete, or inconsistent, the analysis is wrong in ways that are often invisible until the decision has been made. Poor data quality is the largest single reason analytics projects fail to deliver business value.

The cost of poor data quality is rarely visible on a single line of the profit and loss statement, but it shows up everywhere — in misdirected marketing spend, in regulatory fines, in operational rework, in management decisions made on the wrong information. A disciplined approach to data quality assessment is therefore not a back-office concern; it is a precondition for every form of analytics covered in this book.

6.2 Defining Data Quality

Data Quality is the degree to which a set of data is fit for the purpose for which it is being used. The seminal work of Richard Y. Wang & Diane M. Strong (1996) in the Journal of Management Information Systems established the modern, multi-dimensional view of data quality, arguing that quality cannot be reduced to accuracy alone. It depends on the consumer of the data, the task, and the context.

Data quality is therefore relative. A customer dataset that is perfectly adequate for a monthly newsletter mail-out may be entirely inadequate for a regulatory return.

6.3 The Dimensions of Data Quality

flowchart TD
    DQ["Data<br>Quality"] --> A["Accuracy"]
    DQ --> B["Completeness"]
    DQ --> C["Consistency"]
    DQ --> D["Timeliness"]
    DQ --> E["Validity"]
    DQ --> F["Uniqueness"]
    DQ --> G["Integrity"]
    DQ --> H["Relevance"]
    style DQ fill:#e3f2fd,stroke:#1976D2
    style A fill:#e8f5e9,stroke:#388E3C
    style B fill:#e8f5e9,stroke:#388E3C
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#fff8e1,stroke:#F9A825
    style E fill:#fff3e0,stroke:#EF6C00
    style F fill:#fff3e0,stroke:#EF6C00
    style G fill:#fce4ec,stroke:#AD1457
    style H fill:#ede7f6,stroke:#4527A0

The Eight Core Dimensions of Data Quality

Dimension	Question Answered	Typical Failure
Accuracy	Does the data correctly describe the real-world entity it represents?	A customer’s address records a city in which they no longer live
Completeness	Are all the required values present?	Twenty per cent of customer records have no email address
Consistency	Does the data agree with itself across systems?	Customer status is “active” in CRM but “closed” in billing
Timeliness	Is the data current enough for the decision it supports?	Yesterday’s stock position is used to plan tomorrow’s promotion
Validity	Does each value conform to the defined format and rules?	Date of birth recorded as 31 February or as a free-text string
Uniqueness	Is each real-world entity represented exactly once?	The same customer appears as three records under three name spellings
Integrity	Are relationships across tables and systems preserved?	An order references a customer ID that does not exist
Relevance	Is the data appropriate to the question being asked?	A national dataset is used to support a city-level decision

The eight dimensions above are the practical core. The full Wang and Strong framework groups dimensions into four broader categories — intrinsic, contextual, representational, and accessibility — which is useful for academic and governance work but more than is needed for day-to-day assessment.

6.3.1 Quantifying Each Dimension

Leo L. Pipino et al. (2002) in Communications of the ACM proposed a small set of standard formulas for turning each dimension into a measurable score, typically a number between zero and one. The most widely used are:

Completeness ratio = number of non-missing values / total expected values.
Uniqueness ratio = number of distinct entities / total records.
Validity ratio = number of records passing format and rule checks / total records.
Consistency ratio = number of records that agree across the systems compared / total records.
Timeliness ratio = a function of the gap between the data’s reference date and the decision’s reference date, scaled by the maximum acceptable lag.

Once each dimension carries a score, the organisation can set thresholds — for example, no dataset is permitted to enter the production warehouse with a completeness ratio below 0.95 — and treat data quality as something that is monitored, not just complained about.

6.4 The Data Quality Assessment Process

flowchart LR
    A["1. Define quality<br>requirements"] --> B["2. Profile<br>the data"]
    B --> C["3. Measure<br>each dimension"]
    C --> D["4. Diagnose<br>root causes"]
    D --> E["5. Remediate<br>and improve"]
    E --> F["6. Monitor<br>continuously"]
    F -.-> A
    style A fill:#fce4ec,stroke:#AD1457
    style B fill:#fff3e0,stroke:#EF6C00
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e3f2fd,stroke:#1976D2
    style E fill:#ede7f6,stroke:#4527A0
    style F fill:#e8f5e9,stroke:#388E3C

6.4.1 Step 1 — Define Quality Requirements

Quality is fitness for purpose, and purpose differs by use case. The first step is therefore to specify, with the business owner of the dataset, what good enough means for the decisions the data is supporting. This produces a per-dimension expectation — for example, 95 per cent completeness on customer email, 99 per cent validity on tax identifier, daily refresh latency of no more than 24 hours.

6.4.2 Step 2 — Profile the Data

Data profiling is the diagnostic examination of a dataset to discover its actual structure, content, and quality. A typical profiling exercise asks:

What is the distribution of each field? Are there unexpected values?
What proportion of values is missing in each field?
How many distinct values does each field carry? Is the cardinality plausible?
Are there duplicates by candidate keys, by name, by address, by hash?
Do dates and numerics fall in plausible ranges?
Do referenced foreign keys exist in the parent tables?

Modern data platforms ship with profiling tools — Power BI’s column profiling, Tableau Prep’s data interpretation, Python’s ydata-profiling, R’s skimr and DataExplorer, and dedicated platforms such as Informatica DQ and Talend DQ.

6.4.3 Step 3 — Measure Each Dimension

Apply the formulas from the previous section to compute a numerical score for each of the eight dimensions on the dataset under review. Compare the scores to the requirements set in Step 1 and identify the gaps.

A data quality scorecard is a useful artefact: a one-page summary that lists each critical dataset, the score against each dimension, and the trend over time.

6.4.4 Step 4 — Diagnose Root Causes

Once a gap is identified, the goal is to find the cause of the gap, not just to clean its symptoms. Common causes include:

Missing or inadequate validation at the point of data entry.
Manual processes that allow free-text fields where coded values were intended.
Integration mismatches between source systems.
Definitional drift, where the same field name carries different meanings in different systems.
Inadequate or missing master data — customer, product, employee, location reference data.
Process changes that were not propagated to data consumers.

A clean dataset that re-fills with bad data each week has not been remediated; it has been swept.

6.4.5 Step 5 — Remediate and Improve

Remediation has two layers: cleaning the existing data, and fixing the upstream cause so the problem does not recur.

Cleansing corrects, completes, deduplicates, and standardises the existing data.
Process redesign changes the upstream process so the cleansed state becomes the default. Examples include adding validation to data-entry forms, introducing reference-data lookups, mandating coded fields, and reconciling cross-system definitions.

A common heuristic is the one-ten-hundred rule: it costs roughly one unit to prevent a data-quality defect at the source, ten units to correct it after capture, and one hundred units to deal with the consequences after a decision has been taken on the bad data.

6.4.6 Step 6 — Monitor Continuously

Data quality is not a project; it is a continuous concern. Mature organisations operate continuous data-quality monitoring, in which the eight dimensions are scored automatically on every refresh, alerts fire when scores fall below thresholds, and trends are reviewed by the data-governance committee on a regular cadence.

Monitoring closes the loop, so that the assessment process feeds back into the quality requirements as the business evolves.

6.5 Common Causes of Poor Data Quality

Manual data entry without validation: Free-text fields invite spelling variants, abbreviations, and typos. The same customer becomes three records.
System-to-system integration without reconciliation: When two systems were never designed to talk to each other, their identifiers, codes, and definitions diverge.
Definitional disagreement: Marketing’s active customer and Finance’s active customer are not the same customer. Reports built on either definition disagree without anyone being technically wrong.
Unmaintained reference data: Product, location, and organisational hierarchies that are not actively curated drift out of date as the business changes.
Migration scars: Each system migration leaves behind orphan records, fields used for purposes their original designers did not anticipate, and codes whose meaning is lost.
Process changes not reflected in data: A change to the operational process that is not communicated to data consumers produces invisible breaks in time series.
Sensor and instrumentation drift: Industrial and IoT data degrades silently as sensors age, are replaced, or are recalibrated.
Cultural neglect: When data quality is no one’s job, it is no one’s success. Without ownership, it decays.

6.6 Data Quality Governance

Sustained data quality requires more than tools. It requires a governance structure that assigns clear responsibility:

Data owner: A senior business leader accountable for the quality of a domain of data — customer, product, finance, employee.
Data steward: A subject-matter expert responsible for definitions, rules, and quality monitoring within a domain.
Data custodian: The technical role that operates the platforms and pipelines on which the data lives.
Data quality forum or council: A cross-functional body that resolves definitional disputes, prioritises remediation, and reports quality trends to leadership.

The DAMA Data Management Body of Knowledge (DAMA-DMBOK) is the most widely adopted industry framework for this governance structure and is often used as a reference by Indian and global organisations setting up enterprise data programmes.

6.7 Common Pitfalls

Equating data quality with accuracy alone: Accuracy is one dimension of eight. A perfectly accurate but stale dataset is still poor quality.
Cleaning symptoms without fixing causes: Recurring data-quality defects indicate an upstream problem that no amount of downstream cleansing will resolve.
Boiling the ocean: Trying to assess every field of every dataset at once. Start with the data that supports the decisions that matter most.
Tooling without ownership: Buying a data-quality platform without naming a data owner produces an unloved console of red lights.
One-off assessments: Conducting a quality assessment, fixing the issues, and never measuring again. Quality decays without continuous monitoring.
Measurement without thresholds: Reporting completeness ratios without saying which datasets must clear which thresholds turns the scorecard into wallpaper.
Confusing volume with quality: A larger dataset is not a better dataset. More records of poor quality compound the problem.
Ignoring metadata: Lineage, definitions, and provenance are part of data quality. A field whose meaning no one can confirm is, by definition, low-quality.

6.8 Illustrative Cases

The following cases illustrate how the eight dimensions and the six-step process play out in practice. They are based on the kinds of work commonly seen in industry; the framing is the author’s.

Customer Master in a Retail Bank

A retail bank discovers that a single customer is, on average, represented by 1.4 records in its customer master. The cause is years of free-text name entry, branch-level merger of legacy systems, and inconsistent use of the national identifier as a key. Uniqueness is the failed dimension. The bank deploys a deterministic-plus-fuzzy matching pipeline to identify duplicates, introduces strict validation on the national identifier at all entry points, and creates a steward for the customer master. Within six months, the duplicate ratio falls from 0.4 to 0.05, and downstream reports across marketing, risk, and finance begin to agree.

Product Master in a Manufacturing Firm

A manufacturer cannot reconcile its monthly sales report with its monthly production report. Investigation reveals that the same physical SKU carries two different codes in two systems, a legacy of a partial migration. Consistency and integrity are the failed dimensions. Remediation is a one-time mapping followed by retirement of the older code, plus a cross-system reference-data process that prevents the issue recurring.

Sensor Data in a Power Plant

A power plant’s predictive-maintenance model begins issuing implausible alerts. Investigation finds that two vibration sensors were replaced during a routine overhaul without the change being recorded; their calibrated baselines no longer match the model’s training data. Accuracy, timeliness, and integrity are simultaneously failed. Remediation includes recalibration, retroactive correction of the affected period, and a sensor-change-control procedure tied directly to the data pipeline.

Loan-Application Data in a Digital Lender

A digital lender’s credit model degrades over a quarter without any change to the model itself. Investigation finds that one upstream third-party data provider has changed the format of a key field from numeric to a comma-formatted string, and the parser has been silently inserting nulls. Validity and completeness are the failed dimensions. The lender introduces schema-validation contracts with all third-party providers and adds automated tests that fail the pipeline when validity drops below threshold.

6.9 Hands-On Exercise: Data Quality Assessment in Power BI

Aim: Use Power BI’s built-in data-quality features — Column Quality, Column Distribution, and Column Profile — to assess a customer master, identify issues against the eight data-quality dimensions, and build a one-page DQ scorecard.

Scenario: A customer master file at Yuvijen Stores Pvt Ltd — 200 customer records — has accumulated quality issues over three years of branch operations. Marketing has commissioned a one-week DQ assessment before the next campaign launch.

Deliverable: A two-page Power BI report — Page 1 with the diagnostic profile, Page 2 with the DQ scorecard.

6.9.1 Step 1 — The Sample Customer Master

customers.csv (extract)

customer_id	name	email	phone	city	dob	signup_date	gender	total_spend
C-1001	Anita Sharma	anita.sharma@example.com	9876543210	Mumbai	1985-04-12	2023-03-14	F	12450
C-1002	ANITA SHARMA		9876543210	mumbai	1985-04-12	2024-09-02	F	0
C-1003	Rajesh Kumar	rajesh.example.com	98765-43211	Delhi	1979-13-05	2025-01-10	M	32100
C-1004	Priya Iyer	priya.iyer@example.com		Chennai	1992-07-22	2024-06-30	f	8900
C-1005	Vinod Patel	vinod.patel@example.com	+919876543212	Ahmedabad	1988-11-03	2023-12-01	Male	0
C-1006	Vinod Patel	vinod.patel@example.com	+919876543212	Ahmedabad	1988-11-03	2023-12-01	Male	0
C-1007	Mohammed Khan	m.khan@example.com	9123456780	Hyderabad	2010-02-14	2025-04-01	M	1200
C-1008	Sunita Devi	sunita@example.com	9000000000	Patna	1965-09-18		F	5400

The extract above is representative of the 200-row file. It contains intentional issues: an exact duplicate (C-1005 and C-1006), a fuzzy duplicate (C-1001 and C-1002), invalid emails, invalid dates (1979-13-05), missing values, mixed-case city values, inconsistent gender codes (M, Male, f, F), implausible date-of-birth values (a customer aged 16 — under-aged for the firm’s policy), and missing signup dates.

6.9.2 Step 2 — Load and Enable the Data-Quality Views

In Power BI Desktop:

Home → Get Data → Text/CSV and select customers.csv.
Click Transform Data to open the Power Query editor.
On the View ribbon, enable all three data-quality views:
- Column quality — adds a small bar at the top of each column showing the share of Valid, Error, and Empty values.
- Column distribution — adds a histogram showing the distinct and unique counts.
- Column profile — adds a detailed bottom panel showing minimum, maximum, average, distinct count, unique count, missing count, and a value-frequency chart for the selected column.
By default, the profiler runs on the first 1,000 rows. For full coverage, click the status-bar message Column profiling based on top 1000 rows and switch to Column profiling based on entire data set.

The three views together are Power BI’s built-in answer to the data-profiling step in the eight-step cleaning process from Chapter 22.

6.9.3 Step 3 — Diagnose the Eight Data-Quality Dimensions

Mapping the Eight Dimensions to Power BI Features

Dimension	What to Look For	Power BI Feature
Completeness	High Empty percentage in Column Quality bar	Column Quality
Validity	High Error percentage; values that fail expected format	Column Quality + Column Profile
Uniqueness	Distinct count near total count for fields that should be unique	Column Distribution
Consistency	Value-frequency chart showing case or spelling variants of the same value	Column Profile (value-frequency panel)
Accuracy	Out-of-range numeric values; implausible dates	Column Profile (min, max, average)
Timeliness	Stale signup dates or missing as-of timestamps	Column Profile on date fields
Integrity	Foreign-key references that do not exist in the parent table	Manual cross-table check via Merge Queries
Relevance	Fields the marketing team will not actually use	Reviewed against the campaign brief

For each dimension, the diagnostic step is:

Click the column of interest in the Power Query editor.
Inspect the three quality views (top bar, distribution, bottom profile).
Note the issue with a one-line comment in the Power Query Applied Steps pane.

6.9.4 Step 4 — Specific Findings for the Sample Dataset

Running the three views on the sample customer master surfaces the following:

Completeness: email is empty for ~6 % of rows; phone for ~4 %; signup_date for ~3 %. Total-spend has zero values that may be valid (new signups) or invalid (data-entry errors).
Validity: email Column Quality shows ~5 % errors after applying a regex check; dob shows several errors due to invalid month values such as 1979-13-05; phone has multiple formats (10-digit, hyphenated, with country code).
Uniqueness: customer_id shows 200 distinct of 200 — clean. But the combination (name, dob, phone) shows distinct count of 195 against 200 records — five duplicates exist when matched on a composite key.
Consistency: city Column Profile shows Mumbai, mumbai, Mumbai (with trailing space) as separate values; gender shows M, Male, m, F, f, Female as six values for what should be two or three categories.
Accuracy: dob minimum value is 2010 — implying a 16-year-old customer, which violates the firm’s adult-only policy. Date range needs review.
Timeliness: signup_date maximum is current; data appears reasonably fresh.
Integrity: store_id references in a separate transactions file would be checked here via Merge Queries against the store master.
Relevance: All eight columns are used by the campaign brief; nothing is irrelevant for this purpose.

6.9.5 Step 5 — Apply Quick Cleaning Steps in Power Query

The same Power Query editor that diagnoses the issues can also remediate many of them, in a recorded sequence the next refresh will repeat:

Trim and Clean text columns: Right-click each text column → Transform → Trim and Clean (removes whitespace and non-printing characters).
Standardise case: Right-click city → Transform → Capitalize Each Word.
Replace inconsistent gender codes: Right-click gender → Replace Values to map M and Male → M; F, f, Female → F.
Filter invalid rows: On dob, click the filter icon and remove rows with errors; do the same for email after a regex validation step.
Add a duplicate flag: Add Column → Index Column, then group by composite key and merge back to flag duplicates without immediately deleting them.

Each step is recorded in the Applied Steps pane and is part of the reproducible pipeline.

6.9.6 Step 6 — Build the DQ Scorecard Page

Switch from Power Query to the Power BI Report view and build the DQ scorecard. Suggested layout:

Top strip: Eight KPI cards, one per dimension, each showing the dimension’s score (a number between 0 and 1) computed as a DAX measure on the cleaned table.
Primary panel (upper-left): A table listing each column with its completeness, validity, and uniqueness ratios, conditionally formatted in a red-amber-green scale.
Supporting panels (middle): A pair of bar charts — Issues by dimension and Issues by column — pointing the team at the highest-leverage remediation.
Footer: As-of timestamp from MAX(SignupDate), source file name, and a small text annotation summarising the top three remediation priorities.

Sample DAX measures:

Completeness Email = DIVIDE(
  COUNTROWS(FILTER(Customers, NOT(ISBLANK(Customers[email])))),
  COUNTROWS(Customers)
)

Uniqueness Composite = DIVIDE(
  DISTINCTCOUNT(Customers[CompositeKey]),
  COUNTROWS(Customers)
)

The scorecard turns a one-week diagnostic exercise into a piece of standing infrastructure: it can be refreshed monthly and trended over time so the firm sees data quality improving (or not).

6.9.7 Step 7 — Connect to the Visualisation Layer

The DQ scorecard is itself a visualisation, but its purpose is to protect the other visualisations the firm produces.

Before any campaign dashboard refreshes, its underlying table should pass the DQ thresholds set in this scorecard.
Audience-facing dashboards should reference the DQ score for their underlying source — for example, a small Data Quality: 96 % indicator in the footer of the marketing dashboard.
A trend chart of DQ scores over time is itself a board-level KPI for any mature analytics function.

The same Column Quality, Distribution, and Profile features in Power BI can be combined with Tableau Prep’s Profile Pane and Excel’s Data Validation to produce a multi-tool DQ programme. The principles transfer; the buttons differ.

Files and Screen Recordings

Power BI file (yuvijen-customer-dq.pbix), the sample customers.csv, and screen recordings of the Column Quality, Column Distribution, and Column Profile walk-throughs will be embedded here.

Summary

Concept	Description
Foundations
Why Data Quality Matters	Poor data quality is the largest single reason analytics projects fail to deliver value
Data Quality	The degree to which data is fit for the purpose for which it is being used
The Eight Dimensions
Accuracy	Does the data correctly describe the real-world entity it represents
Completeness	Are all the required values present in the dataset
Consistency	Does the data agree with itself across systems and reports
Timeliness	Is the data current enough for the decision it supports
Validity	Does each value conform to the defined format, type, and rules
Uniqueness	Is each real-world entity represented exactly once
Integrity	Are relationships across tables and systems preserved
Relevance	Is the data appropriate to the question being asked
Quantifying the Dimensions
Completeness Ratio	Non-missing values divided by total expected values
Uniqueness Ratio	Distinct entities divided by total records
Validity Ratio	Records passing format and rule checks divided by total records
Consistency Ratio	Records that agree across compared systems divided by total records
Timeliness Ratio	Function of the gap between the data's reference date and the decision's reference date
The Six-Step Assessment Process
Define Quality Requirements	Specify per-dimension expectations with the business owner of the dataset
Profile the Data	Diagnostically examine a dataset to discover its actual structure, content, and quality
Measure Each Dimension	Compute a numerical score for each dimension and compare to the requirements
Diagnose Root Causes	Find the upstream cause of a quality gap rather than only treating its symptoms
Remediate and Improve	Cleanse existing data and redesign the upstream process so the cleansed state is default
Monitor Continuously	Score dimensions automatically on every refresh, alert on threshold breaches, and review trends
Tools and Heuristics
Data Profiling	Tools and techniques to surface distributions, missingness, cardinality, duplicates, and outliers
One-Ten-Hundred Rule	Heuristic that prevention costs roughly one unit, correction ten units, and downstream consequences one hundred
Continuous Data-Quality Monitoring	Mature practice of automated, threshold-driven quality monitoring on every dataset refresh
Common Causes of Poor Data Quality
Manual Entry Without Validation	Free-text fields without validation invite spelling variants and duplicates
Integration Without Reconciliation	Two systems whose identifiers, codes, and definitions diverge over time
Definitional Disagreement	The same field name carrying different meanings in different functions
Unmaintained Reference Data	Reference data such as products, locations, and hierarchies drifting out of date
Migration Scars	Each migration leaves orphan records, repurposed fields, and codes of lost meaning
Sensor and Instrument Drift	Industrial and IoT data degrades silently as sensors age, fail, or are recalibrated
Cultural Neglect	When data quality is no one's responsibility, it decays continuously
Data Quality Governance
Data Owner	Senior business leader accountable for the quality of a data domain
Data Steward	Subject-matter expert responsible for definitions, rules, and monitoring within a domain
Data Custodian	Technical role that operates the platforms and pipelines on which the data lives
Data Quality Council	Cross-functional body that resolves disputes, prioritises remediation, and reports quality trends
DAMA-DMBOK	Data Management Body of Knowledge; the most widely adopted industry data-management framework
Common Pitfalls
Equating Quality with Accuracy	Pitfall of treating accuracy as the whole of data quality and ignoring the other seven dimensions
Cleaning Symptoms	Pitfall of cleansing recurring defects without addressing the upstream cause
Boiling the Ocean	Pitfall of trying to assess every field of every dataset rather than starting with what matters
Tooling Without Ownership	Pitfall of buying a quality platform without naming a business owner for the data
One-Off Assessments	Pitfall of conducting one assessment, fixing issues, and never measuring again
Measurement Without Thresholds	Pitfall of reporting scores without thresholds, which turns the scorecard into wallpaper
Confusing Volume with Quality	Pitfall of treating bigger data as better data, when more poor records compound the problem
Ignoring Metadata	Pitfall of ignoring lineage, definitions, and provenance, which are part of data quality