26 Exploratory Data Analysis Techniques and Workflows

26.1 Why Exploratory Data Analysis Matters

Before any chart, dashboard, or model is built, the analyst must look at the data.

Exploratory Data Analysis (EDA) is the disciplined initial examination of a dataset using summary statistics and graphical methods, before any formal modelling or finalised reporting. It surfaces the structure, distributions, anomalies, and relationships in the data — the patterns the analyst must understand before they can choose the right chart, the right model, or the right business question to ask of the data.

The practice was formalised by John Tukey in Exploratory Data Analysis (John W. Tukey, 1977), the book that named the discipline and gave us the boxplot, the stem-and-leaf display, and the term outlier. Tukey’s central insight was that graphical methods are not optional decoration — they are the most efficient way for the analyst’s eye to detect what the data actually contains, and they should run before hypothesis testing, not after. John T. Behrens (1997) distilled the practice into a set of teachable principles for working analysts.

For a visualisation-focused analytics book, EDA is doubly important. The dashboards and reports the firm publishes are only as good as the analyst’s understanding of the underlying data — and that understanding comes from a structured EDA pass.

26.2 Defining EDA

Exploratory Data Analysis is the iterative, visual, hypothesis-generating examination of a dataset. Its purpose is to:

Understand structure — what variables exist, how they are distributed, what their data types and ranges are.
Detect anomalies — outliers, missing patterns, duplicates, suspicious values.
Identify relationships — between pairs and groups of variables.
Generate hypotheses — questions worth investigating with formal modelling later.
Inform chart selection — what visualisation will the audience need?

EDA is exploratory by design. Unlike confirmatory analysis, which tests a pre-specified hypothesis, EDA is open-ended and iterative. The analyst follows the data wherever it leads.

26.3 The EDA Workflow

flowchart LR
    A["1. Understand<br>the dataset"] --> B["2. Univariate<br>profile"]
    B --> C["3. Bivariate<br>exploration"]
    C --> D["4. Multivariate<br>exploration"]
    D --> E["5. Detect<br>anomalies"]
    E --> F["6. Generate<br>hypotheses"]
    F --> G["7. Document<br>findings"]
    G -.-> A
    style A fill:#fce4ec,stroke:#AD1457
    style B fill:#fff3e0,stroke:#EF6C00
    style C fill:#fff8e1,stroke:#F9A825
    style D fill:#e3f2fd,stroke:#1976D2
    style E fill:#ede7f6,stroke:#4527A0
    style F fill:#e8f5e9,stroke:#388E3C
    style G fill:#f3e5f5,stroke:#6A1B9A

A pragmatic seven-step workflow:

Understand the dataset: Read the schema, the documentation, and the source. Confirm what each variable means before computing anything.
Univariate profile: For each variable individually — distribution, central tendency, dispersion, missingness, value frequencies.
Bivariate exploration: Pairs of variables — scatter plots, grouped bars, box plots by category.
Multivariate exploration: Three or more variables together — small multiples, faceted plots, correlation matrices.
Detect anomalies: Outliers, broken time series, suspicious patterns the analyst could not have predicted.
Generate hypotheses: Translate observations into questions for formal analysis.
Document findings: A short narrative record of what was looked at and what was found.

EDA is iterative; later steps often send the analyst back to earlier ones with sharper questions.

26.4 Univariate Techniques

For each individual variable, the EDA toolkit is well established:

Numerical variables:
- Summary statistics (mean, median, mode, standard deviation, quartiles, range).
- Histogram for distribution shape.
- Box plot for centre, spread, outliers.
- Density plot for smoothed shape.
- Q-Q plot for distributional checks against normality.
Categorical variables:
- Frequency table.
- Bar chart for comparison of category counts.
- Pareto chart when the analyst wants to highlight the dominant categories.
Date and time variables:
- Time-series line chart.
- Calendar heatmap for day-of-week and month-of-year patterns.

26.5 Bivariate and Multivariate Techniques

Two numeric variables: Scatter plot is the default. Add a trend line or smoothed loess curve to see the shape of the relationship.
Numeric by categorical: Box plot, violin plot, or grouped histogram per category.
Two categorical variables: Cross-tabulation as a heatmap or stacked bar.
Three or more variables: Small multiples (Tufte’s idea, covered in Module 2) — a grid of charts faceted by a third variable. Pair plots / scatter-plot matrices for many numeric variables. Parallel coordinates plots for high-dimensional categorical comparison.
Correlation matrix as a heatmap for many numeric variables.

The choice between these is governed by the chart-selection framework in Chapter 13. EDA is where that framework gets its heaviest workout.

26.6 Anomaly Detection in EDA

A meaningful share of EDA effort goes into finding things that are wrong in the data, not just summarising what is right:

Outliers — single observations far from the bulk of the distribution. Detected via box plots, scatter plots, and z-score thresholds.
Missing patterns — fields that are missing for some segments but not others. Detected via missingness heatmaps.
Constant or near-constant variables — fields that take the same value for almost every record; useless for modelling but easy to miss.
Duplicates and near-duplicates — pairs of records that should be the same entity.
Implausible values — negative ages, dates in the future, percentages above 100.
Schema drift — a variable’s distribution changing abruptly between time periods, indicating an upstream pipeline issue.

A disciplined EDA pass surfaces these before they bias every downstream analysis.

26.7 EDA in Power BI and Tableau

Both major BI tools provide EDA-specific features alongside their dashboarding capabilities:

Power BI:
- Quick Insights runs algorithmic exploration and surfaces patterns automatically.
- Decomposition Tree lets the analyst drill into a measure across multiple dimensions interactively.
- Smart Narrative generates plain-language summaries of a visual.
- Q&A lets the analyst type natural-language questions and receive auto-generated charts.
- Key Influencers identifies the variables most associated with a target outcome.
Tableau:
- Show Me recommends chart types based on the variables selected.
- Explain Data uses statistical methods to explain why a particular mark is high or low.
- Summary Card shows descriptive statistics for any selection.
- Forecast and Trend Line in the Analytics pane add quick statistical overlays.

These features do not replace the analyst’s judgement; they accelerate the early passes of the EDA workflow.

26.8 Common Pitfalls

Skipping EDA: Going straight to modelling or dashboarding without examining the data. The most common cause of analytical projects that produce confident but wrong conclusions.
Univariate-Only EDA: Looking only at distributions and missing rates without examining relationships between variables.
Confirmation Bias: Looking only for patterns that match the analyst’s existing hypothesis.
Over-Aggregating: Summarising the data to the point where the structure that EDA was meant to surface is averaged away.
No Documentation: Conducting EDA in throwaway notebooks and not capturing the findings; the next analyst repeats the work.
Ignoring Anomalies: Treating outliers as noise to be removed rather than as signal to be investigated.
Premature Modelling: Fitting models on data the analyst has not yet understood.

26.9 Illustrative Cases

A Retail Dataset Reveals Channel Bias

A retail analyst conducts EDA on a year of transaction data. The univariate profile shows the data is right-skewed in revenue per order — typical for retail. But a small multiple of histograms by channel reveals that the in-store data has integer-rupee values while the online data has values to two decimals. Investigation finds that an upstream rounding step in the in-store pipeline was silently introducing bias. The model the team was about to build would have inherited it.

Customer Churn Reveals a Survivorship Pattern

A churn-modelling team plots tenure-by-churn and notices that customer counts drop sharply at month 12 — the point of the firm’s first contract renewal. The drop turned out to be a data artefact: customers who churned before month 12 were excluded from the dataset by an earlier filter. The discovery reshapes the modelling design entirely.

26.10 Hands-On Exercise: End-to-End EDA Project in Power BI and Tableau

Aim: Conduct a complete exploratory data analysis on a retail dataset, in both Power BI and Tableau, applying the seven-step EDA workflow.

Scenario: A 5,000-row sales dataset for Yuvijen Stores Pvt Ltd covering 12 months across 6 stores, 4 product categories, and 3 customer segments.

Deliverable: A Power BI EDA workbook plus the equivalent in Tableau, plus a one-page findings summary listing three or four insights for further investigation.

26.10.1 Step 1 — Load the Data and Understand the Schema

sales_eda.csv (extract)

order_id	order_date	store_id	category	segment	payment_method	quantity	amount
O-2001	2025-04-01	S01	Kitchen	Premium	Card	2	540
O-2002	2025-04-01	S02	Bath	Standard	UPI	5	425
O-2003	2025-04-02	S01	Apparel	Premium	Card	1	1250
O-2004	2025-04-02	S03	Kitchen	Budget	Cash	3	270

In Power BI: Get Data → Text/CSV. In Tableau: connect via Text File. In both, verify each column’s type before proceeding.

26.10.2 Step 2 — Univariate Profile

In Power BI, open Power Query and enable View → Column Quality, Column Distribution, Column Profile (configured for the entire dataset). For each variable note:

amount — distribution, mean, median, range, skewness.
quantity — discrete distribution.
category, segment, payment_method — frequency distribution.
order_date — earliest, latest, gaps.

In Tableau, drag each variable to a worksheet and use Show Me to surface the appropriate univariate chart automatically. Use the Summary Card (Worksheet → Show Summary) for descriptive statistics.

Document one observation per variable in a short comment block.

26.10.3 Step 3 — Bivariate Exploration

Build five bivariate views in each tool:

Amount by category — box plot.
Amount by segment — box plot.
Quantity vs amount — scatter plot with trend line.
Amount over order_date — line chart with monthly aggregation.
Payment method by segment — heatmap of cross-tabulation counts.

In Power BI, use the scatter chart with Analytics → Trend Line. In Tableau, Analytics → Trend Line on the scatter.

26.10.4 Step 4 — Multivariate Exploration

Power BI Decomposition Tree: Drag amount to Analyze and add category, store_id, segment to Explain by. Click + on a high-value node to drill into what is driving the figure.
Power BI Key Influencers: Drag a target measure (amount) to Analyze and the categorical variables to Explain by. The visual surfaces the strongest drivers.
Tableau small multiples: Build a grid of amount over time faceted by store_id and category.

These multivariate views reveal patterns that bivariate ones miss — for example, a category whose performance varies by store in a way that the overall category trend hides.

26.10.5 Step 5 — Detect Anomalies

Use the box plots from Step 3 to identify outlier transactions. In Tableau’s box plot, the outliers are individual marks beyond the whiskers; click each to inspect.
In Power BI, use Find Anomalies on a time-series line chart (Analytics pane) to flag unusual values automatically.
Look for missing patterns by colouring a heatmap of missing count by date and store.
Check for implausible values: zero or negative amounts, future dates, quantities above a sensible maximum.

Investigate each anomaly. Some are data errors (fix them); some are genuine outliers (note them).

26.10.6 Step 6 — AI-Assisted Insights

Power BI Smart Narrative: Right-click on a visual → Summarize; Power BI generates a plain-language summary identifying highest, lowest, biggest changes.
Power BI Q&A: Type natural-language questions like “top 5 stores by amount last quarter” and receive auto-built visuals.
Tableau Explain Data: Select a mark on any chart → Explain Data; Tableau identifies likely explanatory variables for that point.

Treat these features as accelerators, not as substitutes for analyst judgement. Verify any auto-generated finding against the underlying data before reporting it.

26.10.7 Step 7 — Document the Findings

The One-Page Findings Summary

Finding	Where Detected	Hypothesis to Investigate
Premium-segment customers buy 3× the average basket value	Box plot, segment vs amount	Is segment definition correct, or is segment a proxy for store location?
Store S04 declines mid-year while others grow	Small multiple of stores by month	Is S04 affected by a local event the data does not capture?
UPI payment method dominates Standard segment but not Premium	Cross-tab heatmap	Should the firm push UPI adoption in Premium?
3 % of orders have zero amount	Anomaly detection	Are these refunds wrongly captured as orders, or genuine free-promo events?

The findings summary is the deliverable that travels back to the business. Each finding becomes a hypothesis the team can investigate with more focused analysis.

26.10.8 Connect to the Visualisation Layer

A good EDA does more than reveal patterns to the analyst. It also tells the analyst which charts the dashboard should carry:

The strong segment-by-amount difference suggests the executive dashboard should facet revenue by segment.
The S04 anomaly suggests a per-store dashboard with explicit period-over-period indicators.
The payment-method-by-segment pattern suggests a heatmap on the marketing dashboard.

Every insight from EDA is a candidate for a dashboard visual. The discipline of recording the EDA findings is also the discipline of designing the dashboards that follow.

Files and Screen Recordings

Power BI file (yuvijen-eda.pbix), Tableau workbook (yuvijen-eda.twbx), the source sales_eda.csv, and screen recordings of the seven-step workflow in both tools will be embedded here.

Summary

Concept	Description
Foundations
Why EDA Matters	Before any chart, dashboard, or model is built, the analyst must look at the data
Exploratory Data Analysis	Iterative visual hypothesis-generating examination of a dataset before formal modelling
The Seven-Step Workflow
Understand the Dataset	Read the schema, documentation, and source; confirm each variable means what it says
Univariate Profile	Distribution, centre, dispersion, missingness, and value frequencies for each variable individually
Bivariate Exploration	Pairs of variables: scatter, grouped bar, box plot by category
Multivariate Exploration	Three or more variables together: small multiples, faceted plots, correlation matrices
Detect Anomalies	Outliers, missing patterns, constant variables, duplicates, implausible values, schema drift
Generate Hypotheses	Translate observations into questions for formal analysis
Document Findings	Short narrative record of what was examined and what was found
Univariate Techniques
Numerical Summary Stats	Mean, median, mode, standard deviation, quartiles, range
Histogram	Distribution shape for a numerical variable
Box Plot	Centre, spread, and outliers for a numerical variable
Density Plot	Smoothed distribution shape for a numerical variable
Q-Q Plot	Distributional check against normality or another reference distribution
Categorical Frequency Table	Counts and proportions of each category value
Bar Chart	Comparison of category counts
Pareto Chart	Bar chart sorted to highlight dominant categories with cumulative line
Time-Series Line Chart	Default for any time-stamped variable
Calendar Heatmap	Day-of-week and month-of-year patterns rendered as a coloured grid
Bivariate and Multivariate
Scatter Plot	Default for two numerical variables; add trend line for the relationship shape
Grouped Box Plot	Distribution of a numerical variable across categories
Cross-Tab Heatmap	Two categorical variables rendered as a heatmap or stacked bar
Small Multiples	Grid of charts faceted by a third variable; the multivariate workhorse
Pair Plot	Grid of pairwise scatter plots covering many numeric variables
Parallel Coordinates	High-dimensional categorical comparison
Correlation Heatmap	Coloured matrix of pairwise correlations across many numeric variables
Anomaly Detection
Outlier	Single observation far from the bulk of the distribution
Missing Pattern	Field missing for some segments but not others, suggesting upstream cause
Constant Variable	Field that takes the same value for almost every record; useless and easy to miss
Duplicate	Pair of records that should be the same entity
Implausible Values	Negative ages, future dates, percentages above 100
Schema Drift	Distribution changing abruptly between periods, indicating pipeline issue
Power BI EDA Features
Power BI Quick Insights	Algorithmic exploration that surfaces patterns automatically
Power BI Decomposition Tree	Interactive drill into a measure across multiple dimensions
Power BI Smart Narrative	Plain-language auto-summary of a selected visual
Power BI Q and A	Natural-language questions producing auto-built visuals
Power BI Key Influencers	Identifies variables most associated with a target outcome
Tableau EDA Features
Tableau Show Me	Recommends chart types automatically based on selected variables
Tableau Explain Data	Statistical method to explain why a particular mark is high or low
Tableau Summary Card	Descriptive statistics for any selection in a Tableau worksheet
Tableau Forecast	Quick exponential-smoothing forecast added from the Analytics pane
Common Pitfalls
Skipping EDA	Pitfall of going to modelling or dashboarding without examining the data
Univariate-Only EDA	Pitfall of looking only at single-variable summaries while ignoring relationships
Confirmation Bias	Pitfall of looking only for patterns that match an existing hypothesis
Over-Aggregating	Pitfall of summarising until the structure EDA was meant to find is averaged away
No Documentation	Pitfall of conducting EDA in throwaway notebooks; the next analyst repeats it
Ignoring Anomalies	Pitfall of treating outliers as noise to be removed rather than signal to be investigated
Premature Modelling	Pitfall of fitting models on data the analyst has not yet understood