flowchart LR
A["1. Understand<br>the dataset"] --> B["2. Univariate<br>profile"]
B --> C["3. Bivariate<br>exploration"]
C --> D["4. Multivariate<br>exploration"]
D --> E["5. Detect<br>anomalies"]
E --> F["6. Generate<br>hypotheses"]
F --> G["7. Document<br>findings"]
G -.-> A
style A fill:#fce4ec,stroke:#AD1457
style B fill:#fff3e0,stroke:#EF6C00
style C fill:#fff8e1,stroke:#F9A825
style D fill:#e3f2fd,stroke:#1976D2
style E fill:#ede7f6,stroke:#4527A0
style F fill:#e8f5e9,stroke:#388E3C
style G fill:#f3e5f5,stroke:#6A1B9A
26 Exploratory Data Analysis Techniques and Workflows
26.1 Why Exploratory Data Analysis Matters
Before any chart, dashboard, or model is built, the analyst must look at the data.
Exploratory Data Analysis (EDA) is the disciplined initial examination of a dataset using summary statistics and graphical methods, before any formal modelling or finalised reporting. It surfaces the structure, distributions, anomalies, and relationships in the data — the patterns the analyst must understand before they can choose the right chart, the right model, or the right business question to ask of the data.
The practice was formalised by John Tukey in Exploratory Data Analysis (John W. Tukey, 1977), the book that named the discipline and gave us the boxplot, the stem-and-leaf display, and the term outlier. Tukey’s central insight was that graphical methods are not optional decoration — they are the most efficient way for the analyst’s eye to detect what the data actually contains, and they should run before hypothesis testing, not after. John T. Behrens (1997) distilled the practice into a set of teachable principles for working analysts.
For a visualisation-focused analytics book, EDA is doubly important. The dashboards and reports the firm publishes are only as good as the analyst’s understanding of the underlying data — and that understanding comes from a structured EDA pass.
26.2 Defining EDA
Exploratory Data Analysis is the iterative, visual, hypothesis-generating examination of a dataset. Its purpose is to:
- Understand structure — what variables exist, how they are distributed, what their data types and ranges are.
- Detect anomalies — outliers, missing patterns, duplicates, suspicious values.
- Identify relationships — between pairs and groups of variables.
- Generate hypotheses — questions worth investigating with formal modelling later.
- Inform chart selection — what visualisation will the audience need?
EDA is exploratory by design. Unlike confirmatory analysis, which tests a pre-specified hypothesis, EDA is open-ended and iterative. The analyst follows the data wherever it leads.
26.3 The EDA Workflow
A pragmatic seven-step workflow:
- Understand the dataset: Read the schema, the documentation, and the source. Confirm what each variable means before computing anything.
- Univariate profile: For each variable individually — distribution, central tendency, dispersion, missingness, value frequencies.
- Bivariate exploration: Pairs of variables — scatter plots, grouped bars, box plots by category.
- Multivariate exploration: Three or more variables together — small multiples, faceted plots, correlation matrices.
- Detect anomalies: Outliers, broken time series, suspicious patterns the analyst could not have predicted.
- Generate hypotheses: Translate observations into questions for formal analysis.
- Document findings: A short narrative record of what was looked at and what was found.
EDA is iterative; later steps often send the analyst back to earlier ones with sharper questions.
26.4 Univariate Techniques
For each individual variable, the EDA toolkit is well established:
-
Numerical variables:
- Summary statistics (mean, median, mode, standard deviation, quartiles, range).
- Histogram for distribution shape.
- Box plot for centre, spread, outliers.
- Density plot for smoothed shape.
- Q-Q plot for distributional checks against normality.
-
Categorical variables:
- Frequency table.
- Bar chart for comparison of category counts.
- Pareto chart when the analyst wants to highlight the dominant categories.
-
Date and time variables:
- Time-series line chart.
- Calendar heatmap for day-of-week and month-of-year patterns.
26.5 Bivariate and Multivariate Techniques
- Two numeric variables: Scatter plot is the default. Add a trend line or smoothed loess curve to see the shape of the relationship.
- Numeric by categorical: Box plot, violin plot, or grouped histogram per category.
- Two categorical variables: Cross-tabulation as a heatmap or stacked bar.
- Three or more variables: Small multiples (Tufte’s idea, covered in Module 2) — a grid of charts faceted by a third variable. Pair plots / scatter-plot matrices for many numeric variables. Parallel coordinates plots for high-dimensional categorical comparison.
- Correlation matrix as a heatmap for many numeric variables.
The choice between these is governed by the chart-selection framework in Chapter 13. EDA is where that framework gets its heaviest workout.
26.6 Anomaly Detection in EDA
A meaningful share of EDA effort goes into finding things that are wrong in the data, not just summarising what is right:
- Outliers — single observations far from the bulk of the distribution. Detected via box plots, scatter plots, and z-score thresholds.
- Missing patterns — fields that are missing for some segments but not others. Detected via missingness heatmaps.
- Constant or near-constant variables — fields that take the same value for almost every record; useless for modelling but easy to miss.
- Duplicates and near-duplicates — pairs of records that should be the same entity.
- Implausible values — negative ages, dates in the future, percentages above 100.
- Schema drift — a variable’s distribution changing abruptly between time periods, indicating an upstream pipeline issue.
A disciplined EDA pass surfaces these before they bias every downstream analysis.
26.7 EDA in Power BI and Tableau
Both major BI tools provide EDA-specific features alongside their dashboarding capabilities:
-
Power BI:
- Quick Insights runs algorithmic exploration and surfaces patterns automatically.
- Decomposition Tree lets the analyst drill into a measure across multiple dimensions interactively.
- Smart Narrative generates plain-language summaries of a visual.
- Q&A lets the analyst type natural-language questions and receive auto-generated charts.
- Key Influencers identifies the variables most associated with a target outcome.
-
Tableau:
- Show Me recommends chart types based on the variables selected.
- Explain Data uses statistical methods to explain why a particular mark is high or low.
- Summary Card shows descriptive statistics for any selection.
- Forecast and Trend Line in the Analytics pane add quick statistical overlays.
These features do not replace the analyst’s judgement; they accelerate the early passes of the EDA workflow.
26.8 Common Pitfalls
- Skipping EDA: Going straight to modelling or dashboarding without examining the data. The most common cause of analytical projects that produce confident but wrong conclusions.
- Univariate-Only EDA: Looking only at distributions and missing rates without examining relationships between variables.
- Confirmation Bias: Looking only for patterns that match the analyst’s existing hypothesis.
- Over-Aggregating: Summarising the data to the point where the structure that EDA was meant to surface is averaged away.
- No Documentation: Conducting EDA in throwaway notebooks and not capturing the findings; the next analyst repeats the work.
- Ignoring Anomalies: Treating outliers as noise to be removed rather than as signal to be investigated.
- Premature Modelling: Fitting models on data the analyst has not yet understood.
26.9 Illustrative Cases
A Retail Dataset Reveals Channel Bias
A retail analyst conducts EDA on a year of transaction data. The univariate profile shows the data is right-skewed in revenue per order — typical for retail. But a small multiple of histograms by channel reveals that the in-store data has integer-rupee values while the online data has values to two decimals. Investigation finds that an upstream rounding step in the in-store pipeline was silently introducing bias. The model the team was about to build would have inherited it.
Customer Churn Reveals a Survivorship Pattern
A churn-modelling team plots tenure-by-churn and notices that customer counts drop sharply at month 12 — the point of the firm’s first contract renewal. The drop turned out to be a data artefact: customers who churned before month 12 were excluded from the dataset by an earlier filter. The discovery reshapes the modelling design entirely.
26.10 Hands-On Exercise: End-to-End EDA Project in Power BI and Tableau
Aim: Conduct a complete exploratory data analysis on a retail dataset, in both Power BI and Tableau, applying the seven-step EDA workflow.
Scenario: A 5,000-row sales dataset for Yuvijen Stores Pvt Ltd covering 12 months across 6 stores, 4 product categories, and 3 customer segments.
Deliverable: A Power BI EDA workbook plus the equivalent in Tableau, plus a one-page findings summary listing three or four insights for further investigation.
26.10.1 Step 1 — Load the Data and Understand the Schema
| order_id | order_date | store_id | category | segment | payment_method | quantity | amount |
|---|---|---|---|---|---|---|---|
| O-2001 | 2025-04-01 | S01 | Kitchen | Premium | Card | 2 | 540 |
| O-2002 | 2025-04-01 | S02 | Bath | Standard | UPI | 5 | 425 |
| O-2003 | 2025-04-02 | S01 | Apparel | Premium | Card | 1 | 1250 |
| O-2004 | 2025-04-02 | S03 | Kitchen | Budget | Cash | 3 | 270 |
In Power BI: Get Data → Text/CSV. In Tableau: connect via Text File. In both, verify each column’s type before proceeding.
26.10.2 Step 2 — Univariate Profile
In Power BI, open Power Query and enable View → Column Quality, Column Distribution, Column Profile (configured for the entire dataset). For each variable note:
- amount — distribution, mean, median, range, skewness.
- quantity — discrete distribution.
- category, segment, payment_method — frequency distribution.
- order_date — earliest, latest, gaps.
In Tableau, drag each variable to a worksheet and use Show Me to surface the appropriate univariate chart automatically. Use the Summary Card (Worksheet → Show Summary) for descriptive statistics.
Document one observation per variable in a short comment block.
26.10.3 Step 3 — Bivariate Exploration
Build five bivariate views in each tool:
- Amount by category — box plot.
- Amount by segment — box plot.
- Quantity vs amount — scatter plot with trend line.
- Amount over order_date — line chart with monthly aggregation.
- Payment method by segment — heatmap of cross-tabulation counts.
In Power BI, use the scatter chart with Analytics → Trend Line. In Tableau, Analytics → Trend Line on the scatter.
26.10.4 Step 4 — Multivariate Exploration
- Power BI Decomposition Tree: Drag amount to Analyze and add category, store_id, segment to Explain by. Click + on a high-value node to drill into what is driving the figure.
- Power BI Key Influencers: Drag a target measure (amount) to Analyze and the categorical variables to Explain by. The visual surfaces the strongest drivers.
- Tableau small multiples: Build a grid of amount over time faceted by store_id and category.
These multivariate views reveal patterns that bivariate ones miss — for example, a category whose performance varies by store in a way that the overall category trend hides.
26.10.5 Step 5 — Detect Anomalies
- Use the box plots from Step 3 to identify outlier transactions. In Tableau’s box plot, the outliers are individual marks beyond the whiskers; click each to inspect.
- In Power BI, use Find Anomalies on a time-series line chart (Analytics pane) to flag unusual values automatically.
- Look for missing patterns by colouring a heatmap of missing count by date and store.
- Check for implausible values: zero or negative amounts, future dates, quantities above a sensible maximum.
Investigate each anomaly. Some are data errors (fix them); some are genuine outliers (note them).
26.10.6 Step 6 — AI-Assisted Insights
- Power BI Smart Narrative: Right-click on a visual → Summarize; Power BI generates a plain-language summary identifying highest, lowest, biggest changes.
- Power BI Q&A: Type natural-language questions like “top 5 stores by amount last quarter” and receive auto-built visuals.
- Tableau Explain Data: Select a mark on any chart → Explain Data; Tableau identifies likely explanatory variables for that point.
Treat these features as accelerators, not as substitutes for analyst judgement. Verify any auto-generated finding against the underlying data before reporting it.
26.10.7 Step 7 — Document the Findings
| Finding | Where Detected | Hypothesis to Investigate |
|---|---|---|
| Premium-segment customers buy 3× the average basket value | Box plot, segment vs amount | Is segment definition correct, or is segment a proxy for store location? |
| Store S04 declines mid-year while others grow | Small multiple of stores by month | Is S04 affected by a local event the data does not capture? |
| UPI payment method dominates Standard segment but not Premium | Cross-tab heatmap | Should the firm push UPI adoption in Premium? |
| 3 % of orders have zero amount | Anomaly detection | Are these refunds wrongly captured as orders, or genuine free-promo events? |
The findings summary is the deliverable that travels back to the business. Each finding becomes a hypothesis the team can investigate with more focused analysis.
26.10.8 Connect to the Visualisation Layer
A good EDA does more than reveal patterns to the analyst. It also tells the analyst which charts the dashboard should carry:
- The strong segment-by-amount difference suggests the executive dashboard should facet revenue by segment.
- The S04 anomaly suggests a per-store dashboard with explicit period-over-period indicators.
- The payment-method-by-segment pattern suggests a heatmap on the marketing dashboard.
Every insight from EDA is a candidate for a dashboard visual. The discipline of recording the EDA findings is also the discipline of designing the dashboards that follow.
Power BI file (yuvijen-eda.pbix), Tableau workbook (yuvijen-eda.twbx), the source sales_eda.csv, and screen recordings of the seven-step workflow in both tools will be embedded here.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why EDA Matters | Before any chart, dashboard, or model is built, the analyst must look at the data |
| Exploratory Data Analysis | Iterative visual hypothesis-generating examination of a dataset before formal modelling |
| The Seven-Step Workflow | |
| Understand the Dataset | Read the schema, documentation, and source; confirm each variable means what it says |
| Univariate Profile | Distribution, centre, dispersion, missingness, and value frequencies for each variable individually |
| Bivariate Exploration | Pairs of variables: scatter, grouped bar, box plot by category |
| Multivariate Exploration | Three or more variables together: small multiples, faceted plots, correlation matrices |
| Detect Anomalies | Outliers, missing patterns, constant variables, duplicates, implausible values, schema drift |
| Generate Hypotheses | Translate observations into questions for formal analysis |
| Document Findings | Short narrative record of what was examined and what was found |
| Univariate Techniques | |
| Numerical Summary Stats | Mean, median, mode, standard deviation, quartiles, range |
| Histogram | Distribution shape for a numerical variable |
| Box Plot | Centre, spread, and outliers for a numerical variable |
| Density Plot | Smoothed distribution shape for a numerical variable |
| Q-Q Plot | Distributional check against normality or another reference distribution |
| Categorical Frequency Table | Counts and proportions of each category value |
| Bar Chart | Comparison of category counts |
| Pareto Chart | Bar chart sorted to highlight dominant categories with cumulative line |
| Time-Series Line Chart | Default for any time-stamped variable |
| Calendar Heatmap | Day-of-week and month-of-year patterns rendered as a coloured grid |
| Bivariate and Multivariate | |
| Scatter Plot | Default for two numerical variables; add trend line for the relationship shape |
| Grouped Box Plot | Distribution of a numerical variable across categories |
| Cross-Tab Heatmap | Two categorical variables rendered as a heatmap or stacked bar |
| Small Multiples | Grid of charts faceted by a third variable; the multivariate workhorse |
| Pair Plot | Grid of pairwise scatter plots covering many numeric variables |
| Parallel Coordinates | High-dimensional categorical comparison |
| Correlation Heatmap | Coloured matrix of pairwise correlations across many numeric variables |
| Anomaly Detection | |
| Outlier | Single observation far from the bulk of the distribution |
| Missing Pattern | Field missing for some segments but not others, suggesting upstream cause |
| Constant Variable | Field that takes the same value for almost every record; useless and easy to miss |
| Duplicate | Pair of records that should be the same entity |
| Implausible Values | Negative ages, future dates, percentages above 100 |
| Schema Drift | Distribution changing abruptly between periods, indicating pipeline issue |
| Power BI EDA Features | |
| Power BI Quick Insights | Algorithmic exploration that surfaces patterns automatically |
| Power BI Decomposition Tree | Interactive drill into a measure across multiple dimensions |
| Power BI Smart Narrative | Plain-language auto-summary of a selected visual |
| Power BI Q and A | Natural-language questions producing auto-built visuals |
| Power BI Key Influencers | Identifies variables most associated with a target outcome |
| Tableau EDA Features | |
| Tableau Show Me | Recommends chart types automatically based on selected variables |
| Tableau Explain Data | Statistical method to explain why a particular mark is high or low |
| Tableau Summary Card | Descriptive statistics for any selection in a Tableau worksheet |
| Tableau Forecast | Quick exponential-smoothing forecast added from the Analytics pane |
| Common Pitfalls | |
| Skipping EDA | Pitfall of going to modelling or dashboarding without examining the data |
| Univariate-Only EDA | Pitfall of looking only at single-variable summaries while ignoring relationships |
| Confirmation Bias | Pitfall of looking only for patterns that match an existing hypothesis |
| Over-Aggregating | Pitfall of summarising until the structure EDA was meant to find is averaged away |
| No Documentation | Pitfall of conducting EDA in throwaway notebooks; the next analyst repeats it |
| Ignoring Anomalies | Pitfall of treating outliers as noise to be removed rather than signal to be investigated |
| Premature Modelling | Pitfall of fitting models on data the analyst has not yet understood |