4  Data Preparation and Transformation in R

Data preparation and transformation are crucial steps in data analysis, ensuring that raw data is cleaned, structured, and formatted for further analysis (Hadley Wickham, 2014). In R, several functions and packages help perform these tasks efficiently. This chapter covers essential data preparation techniques, including data import, cleaning, transformation, and manipulation.

4.1 Data manipulations and transformations

Data Handling in R

Data Handling in R refers to the importing, managing, transforming, and cleaning of data before analysis. R provides powerful functions and packages such as readr, dplyr, tidyr, and data.table to efficiently handle large datasets.

4.1.1 Importing Data into R

Importing data into R is a crucial step in data analysis. R offers various functions for loading data from different file types.

  • R supports importing data from various sources, including ** Excel, CSV, databases(SQL), JSON, and APIs**.

Import csv file from website

  • Paste the csv website link inside the read.csv() function.
  • Sample csv file website link

Import csv file from local folder

Sample csv data file - Download the csv dataset from this link.

copy the file path name or (if working directory is set) of the dataset and paste it inside the read.csv("file name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
# Read CSV file
Indiaagriculture = read.csv("Agri production india.csv")

# Display the first few rows
head(Indiaagriculture)

Import excel file from local folder

readxl: To read Excel files in R, you need to install the readxl package.

  • The readxl package allows users to import Excel spreadsheets (.xls and .xlsx) into R for further analysis.
  • Supports both .xls (Excel 97-2003) and .xlsx (Excel 2007+) formats.
  • Reads specific sheets, ranges, and named regions.
Install readxl package
Code
Sample data file

Download the excel dataset from this link.

copy the file name (or file path name) of the dataset and paste it inside the function read_excel("file_name")

  • To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
  • for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
library(readxl)
Europeanagriculture = read_excel("Europeanagriculture.xlsx", sheet = 1)
head(Europeanagriculture)

Import Data from SQL database

  • First download the sql database file from this link
  • Save the sql file in your directory.
  • Install the required packages shown below.

install.packages("DBI") - Database interface package install.packages("RSQLite") - SQLite database driver (for MySQL, use RMariaDB) install.packages("dplyr") - For data manipulation

Code
Install the required packages for sql

4.1.2 Data file creation in R

use data.frame() function ::: {.panel-tabset}

:::


4.1.3 Viewing and Exploring Data

Explore the Data

Checking Data Structure
  • View() - To view the entire table in a window
  • head() - View the first few rows
  • str() - Check structure of dataset
  • summary() - Summary statistics
Checking Missing Values
Checking Duplicates

4.1.4 Handling missing values and imputations

Replace missing values with mean
Remove missing values

Remove Duplicates

Changing Data Types

4.1.5 Data Manipulation with dplyr

The dplyr package provides functions for filtering, selecting, modifying, and restructuring data.

Selecting Specific Columns

  • Install dplyr package
Code

Removing Columns

Filtering Data

Sorting Data

4.1.6 Reshaping Data with tidyr

Converting Wide Data to Long Format

Converting long data to wide format


4.1.7 Exporting Data from R

Export the data file as a csv file to local folder

The write.csv() function in R is used to export a dataset into a Comma-Separated Values (CSV) file, making it easy to save, share, and analyze data in other tools like Excel, Python, or SQL.

write.csv(data, "folder path/filename.csv") To save the CSV in a specific folder, provide the path.

Export data file as a excel file to local folder

writexl: Writing Data to Excel Files

The writexl package provides an easy way to export data from R into an Excel file without requiring external dependencies.

Key Features:

  • Writes .xlsx files quickly.
  • Preserves column types and formats.
Install writexl package
Code
install writexl package
install.packages("writexl")
  • export the clean data file as xlsx file.
Code
library(writexl)
# Write data to an Excel file
write_xlsx(clean, "cleanstudentsdata.xlsx")

To check the list of files present in the directory (or project)

Code
check the list of files in the directory

4.2 Normalization and Standardization in R

Normalization and standardization are essential techniques in data preprocessing to scale numeric variables, ensuring that they contribute equally to statistical analysis and machine learning models. Since different variables may have varying ranges, applying these transformations helps prevent bias towards larger values and enhances the efficiency of algorithms.

4.2.1 Normalization

Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]. This method ensures that all features have the same scale, preventing variables with large ranges from dominating those with smaller ranges.

Formula for Min-Max Normalization \[ X’ = \frac{X - X_{min}}{X_{max} - X_{min}} \]

where:

  • \(X{\prime}\) is the normalized value.
  • \(X\) is the original value. -\(X_{min}\) and \(X_{max}\) are the minimum and maximum values in the dataset.

When to Use Normalization?

  • When data follows a non-Gaussian (skewed) distribution.
  • When features need to be scaled between 0 and 1 (e.g., for machine learning).
  • When working with neural networks and distance-based algorithms (e.g., k-NN, SVM).
Implementation of Normalization in R

🔹 Output: The values are now between 0 and 1, ensuring equal contribution to models.


4.2.2 Standardization

Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation. Unlike min-max normalization, it does not bound values to a fixed range.

Formula for Z-score Standardization

\[ X’ = \frac{X - \mu}{\sigma} \]

where:

  • \(X{\prime}\) is the standardized value.
  • \(X\) is the original value.
  • \(\mu\) is the mean of the dataset.
  • \(\sigma\) is the standard deviation of the dataset.

When to Use Standardization?

  • When data follows a normal (Gaussian) distribution.
  • When using linear regression, logistic regression, PCA, and clustering.
  • When handling outliers, as it is less sensitive to extreme values than normalization.
Implementation of Standardization in R

🔹 Output: The transformed data now has zero mean (0) and a standard deviation of 1, making it suitable for statistical modeling.


4.2.3 Key Differences Between Normalization and Standardization

Feature Normalization Standardization
Definition Rescales values to a fixed range (typically 0 to 1). Transforms values to have a mean of 0 and a standard deviation of 1.
Formula \[ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} \] \[ X' = \frac{X - \mu}{\sigma} \]
Effect on Data Retains original distribution but compresses values into a fixed range. Adjusts the data to a standard normal distribution (bell-shaped).
Range Typically [0,1] (Min-Max Scaling). Centered around 0, without fixed bounds.
Best Used When Data does not follow a normal distribution. Data follows (or approximately follows) a normal distribution.
Sensitive to Outliers? Yes – A single extreme value can distort scaling. No – Since standardization uses mean and standard deviation, it is more robust to outliers.
Example Dataset Heights of individuals in cm (e.g., 150 cm – 200 cm). Exam scores that vary in scale (e.g., 0-100 vs. 0-10).
Common Applications Neural Networks, KNN, Distance-based models (SVM, Clustering). Linear Regression, Logistic Regression, PCA.
When to Use Normalization vs. Standardization?
  • Use Normalization when:
    • Your data has varying scales and does not follow a normal distribution.
    • You are using distance-based models like KNN, K-Means, Neural Networks.
    • Features have different units (e.g., age in years, salary in dollars).
  • Use Standardization when:
    • Your data follows a Gaussian (Normal) distribution.
    • You are using models that assume normality, like Linear Regression, PCA, Logistic Regression.
    • You need a dataset where outliers have less impact.

4.2.4 Practical Applications of Normalization and Standardization

Domain Use Case Method Reason for Preference
Machine Learning Neural Networks Normalization Keeps feature values within (0,1) range, improving model convergence.
Machine Learning PCA Standardization PCA assumes normally distributed data; z-score scaling ensures proper feature weights.
Machine Learning k-NN Algorithm Normalization KNN is distance-based; scaling prevents one feature from dominating others.
Healthcare Medical Imaging Normalization Pixel intensities are scaled (0 to 1) for better visualization and deep learning models.
Finance Stock Prediction Standardization Stock prices vary widely; standardization makes comparisons meaningful.
E-commerce Purchases Data Normalization Standardizes transaction amounts and frequency for better customer segmentation.
Social Media Analytics Sentiment Analysis Standardization Engagement and word frequency metrics often follow a normal distribution.
Supply Chain Demand Forecast Standardization Ensures seasonal fluctuations are balanced for forecasting accuracy.
Image Processing Object Detection Normalization Image pixel values are scaled, improving computational efficiency.
NLP Word Embeddings Standardization Normalizing word frequency enhances embedding quality and training efficiency.

4.3 Creating and Managing Dummy Variables

Dummy variables are binary (0/1) indicators used in regression models, classification tasks, and machine learning to represent categorical data numerically. Many statistical and machine learning algorithms require numerical inputs, making dummy variables essential for handling categorical variables.

4.3.1 Why Are Dummy Variables Needed?

  • Some models, like linear regression, do not work directly with categorical data.
  • Variables such as Gender (Male/Female) or Customer Type (New/Returning) need to be converted into numbers.
  • Dummy variables help in representing categorical effects in statistical and predictive models.

4.3.2 Creating Dummy Variables in R

Using ifelse() for Binary Categories

If a variable has two categories (e.g., Male/Female), we can create a dummy variable using ifelse().

Using model.matrix() for Multiple Categories

For categorical variables with more than two categories, one-hot encoding is used to create multiple dummy variables.

Note: The -1 in model.matrix(~ Department - 1, data) removes the intercept, ensuring that each category is represented separately.


4.3.3 Managing Dummy Variables

When working with dummy variables, consider the following:

  • Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid multicollinearity.
  • Check for Redundant Variables: If one category can be predicted from the others, remove it.
  • Scaling Dummy Variables: If using models like KNN or clustering, scaling may be required.

To remove one dummy variable in regression analysis:


4.3.4 Applications of Dummy Variables

Application Domain Example Use Case
Regression Analysis Car Type (Sedan/SUV/Truck) Convert car types into dummy variables for predicting car prices.
Customer Segmentation Customer Type (New/Returning) Categorize customers into segments for personalized marketing strategies.
Fraud Detection Payment Method (Credit/Debit) Detect potential fraud by analyzing payment method patterns.
Healthcare Analytics Disease Presence (Yes/No) Transform categorical disease labels into binary format for machine learning models.
  • Dummy variables convert categorical data into numerical format for better processing.

  • Avoid multicollinearity by removing one category in regression models.

  • For large datasets, consider using the fastDummies package for efficient dummy variable creation.

  • Using dummy variables effectively helps improve statistical modeling and machine learning performance, ensuring categorical data can be used in predictive models.


Summary

Concept Description
Data manipulations and transformations
Importing Data into R Importing data into R is a crucial step in data analysis
Data file creation in R use data.frame() function
Viewing and Exploring Data Key concept under Data manipulations and transformations
Handling missing values and imputations Key concept under Data manipulations and transformations
Data Manipulation with `dplyr` The dplyr package provides functions for filtering, selecting, modifying, and restructuring data
Reshaping Data with `tidyr` Key concept under Data manipulations and transformations
Exporting Data from R Key concept under Data manipulations and transformations
Normalization and Standardization in R
Normalization Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]
Standardization Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation
Normalization vs Standardization | Feature | Normalization | Standardization |
Applications of Norm and Stand | Domain | Use Case | Method | Reason for Preference |
Creating and Managing Dummy Variables
Why Are Dummy Variables Needed? Some models, like linear regression, do not work directly with categorical data
Creating Dummy Variables in R Key concept under Creating and Managing Dummy Variables
Managing Dummy Variables When working with dummy variables, consider the following: Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid
Applications of Dummy Variables | Application Domain | Example | Use Case |