4 Data Preparation and Transformation in R

Data preparation and transformation are crucial steps in data analysis, ensuring that raw data is cleaned, structured, and formatted for further analysis (Hadley Wickham, 2014). In R, several functions and packages help perform these tasks efficiently. This chapter covers essential data preparation techniques, including data import, cleaning, transformation, and manipulation.

4.1 Data manipulations and transformations

Data Handling in R

Data Handling in R refers to the importing, managing, transforming, and cleaning of data before analysis. R provides powerful functions and packages such as readr, dplyr, tidyr, and data.table to efficiently handle large datasets.

4.1.1 Importing Data into R

Importing data into R is a crucial step in data analysis. R offers various functions for loading data from different file types.

R supports importing data from various sources, including ** Excel, CSV, databases(SQL), JSON, and APIs**.

Import csv file from website

Paste the csv website link inside the read.csv() function.
Sample csv file website link

Import csv file from local folder

Sample csv data file - Download the csv dataset from this link.

copy the file path name or (if working directory is set) of the dataset and paste it inside the read.csv("file name")

To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
for Windows- ctrl+shift+c for Mac- cmnd+opt+c

Code

# Read CSV file
Indiaagriculture = read.csv("Agri production india.csv")

# Display the first few rows
head(Indiaagriculture)

Import excel file from local folder

readxl: To read Excel files in R, you need to install the readxl package.

The readxl package allows users to import Excel spreadsheets (.xls and .xlsx) into R for further analysis.
Supports both .xls (Excel 97-2003) and .xlsx (Excel 2007+) formats.
Reads specific sheets, ranges, and named regions.

Install readxl package

Code

install.packages("readxl")

Sample data file

Download the excel dataset from this link.

copy the file name (or file path name) of the dataset and paste it inside the function read_excel("file_name")

To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
for Windows- ctrl+shift+c for Mac- cmnd+opt+c

Code

library(readxl)
Europeanagriculture = read_excel("Europeanagriculture.xlsx", sheet = 1)
head(Europeanagriculture)

Import Data from SQL database

First download the sql database file from this link
Save the sql file in your directory.
Install the required packages shown below.

install.packages("DBI") - Database interface package install.packages("RSQLite") - SQLite database driver (for MySQL, use RMariaDB) install.packages("dplyr") - For data manipulation

Code

Install the required packages for sql

install.packages("DBI")       
install.packages("RSQLite")   
install.packages("dplyr")

4.1.2 Data file creation in R

use data.frame() function ::: {.panel-tabset}

:::

4.1.3 Viewing and Exploring Data

Explore the Data

Checking Data Structure

View() - To view the entire table in a window
head() - View the first few rows
str() - Check structure of dataset
summary() - Summary statistics

Checking Missing Values

Checking Duplicates

4.1.4 Handling missing values and imputations

Replace missing values with mean

Remove missing values

Remove Duplicates

Changing Data Types

4.1.5 Data Manipulation with `dplyr`

The dplyr package provides functions for filtering, selecting, modifying, and restructuring data.

Selecting Specific Columns

Install dplyr package

Code

install.packages("dplyr")

Removing Columns

Filtering Data

Sorting Data

4.1.6 Reshaping Data with `tidyr`

Converting Wide Data to Long Format

Converting long data to wide format

4.1.7 Exporting Data from R

Export the data file as a csv file to local folder

The write.csv() function in R is used to export a dataset into a Comma-Separated Values (CSV) file, making it easy to save, share, and analyze data in other tools like Excel, Python, or SQL.

write.csv(data, "folder path/filename.csv") To save the CSV in a specific folder, provide the path.

Export data file as a excel file to local folder

writexl: Writing Data to Excel Files

The writexl package provides an easy way to export data from R into an Excel file without requiring external dependencies.

Key Features:

Writes .xlsx files quickly.
Preserves column types and formats.

Install writexl package

Code

install writexl package

install.packages("writexl")

export the clean data file as xlsx file.

Code

library(writexl)
# Write data to an Excel file
write_xlsx(clean, "cleanstudentsdata.xlsx")

To check the list of files present in the directory (or project)

Code

check the list of files in the directory

list.files()

4.2 Normalization and Standardization in R

Normalization and standardization are essential techniques in data preprocessing to scale numeric variables, ensuring that they contribute equally to statistical analysis and machine learning models. Since different variables may have varying ranges, applying these transformations helps prevent bias towards larger values and enhances the efficiency of algorithms.

4.2.1 Normalization

Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]. This method ensures that all features have the same scale, preventing variables with large ranges from dominating those with smaller ranges.

Formula for Min-Max Normalization \[ X’ = \frac{X - X_{min}}{X_{max} - X_{min}} \]

where:

\(X{\prime}\) is the normalized value.
\(X\) is the original value. -\(X_{min}\) and \(X_{max}\) are the minimum and maximum values in the dataset.

When to Use Normalization?

When data follows a non-Gaussian (skewed) distribution.
When features need to be scaled between 0 and 1 (e.g., for machine learning).
When working with neural networks and distance-based algorithms (e.g., k-NN, SVM).

Implementation of Normalization in R

🔹 Output: The values are now between 0 and 1, ensuring equal contribution to models.

4.2.2 Standardization

Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation. Unlike min-max normalization, it does not bound values to a fixed range.

Formula for Z-score Standardization

\[ X’ = \frac{X - \mu}{\sigma} \]

where:

\(X{\prime}\) is the standardized value.
\(X\) is the original value.
\(\mu\) is the mean of the dataset.
\(\sigma\) is the standard deviation of the dataset.

When to Use Standardization?

When data follows a normal (Gaussian) distribution.
When using linear regression, logistic regression, PCA, and clustering.
When handling outliers, as it is less sensitive to extreme values than normalization.

Implementation of Standardization in R

🔹 Output: The transformed data now has zero mean (0) and a standard deviation of 1, making it suitable for statistical modeling.

4.2.3 Key Differences Between Normalization and Standardization

Feature	Normalization	Standardization
Definition	Rescales values to a fixed range (typically 0 to 1).	Transforms values to have a mean of 0 and a standard deviation of 1.
Formula	\[ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} \]	\[ X' = \frac{X - \mu}{\sigma} \]
Effect on Data	Retains original distribution but compresses values into a fixed range.	Adjusts the data to a standard normal distribution (bell-shaped).
Range	Typically [0,1] (Min-Max Scaling).	Centered around 0, without fixed bounds.
Best Used When	Data does not follow a normal distribution.	Data follows (or approximately follows) a normal distribution.
Sensitive to Outliers?	Yes – A single extreme value can distort scaling.	No – Since standardization uses mean and standard deviation, it is more robust to outliers.
Example Dataset	Heights of individuals in cm (e.g., 150 cm – 200 cm).	Exam scores that vary in scale (e.g., 0-100 vs. 0-10).
Common Applications	Neural Networks, KNN, Distance-based models (SVM, Clustering).	Linear Regression, Logistic Regression, PCA.

When to Use Normalization vs. Standardization?

Use Normalization when:
- Your data has varying scales and does not follow a normal distribution.
- You are using distance-based models like KNN, K-Means, Neural Networks.
- Features have different units (e.g., age in years, salary in dollars).
Use Standardization when:
- Your data follows a Gaussian (Normal) distribution.
- You are using models that assume normality, like Linear Regression, PCA, Logistic Regression.
- You need a dataset where outliers have less impact.

4.2.4 Practical Applications of Normalization and Standardization

Domain	Use Case	Method	Reason for Preference
Machine Learning	Neural Networks	Normalization	Keeps feature values within (0,1) range, improving model convergence.
Machine Learning	PCA	Standardization	PCA assumes normally distributed data; z-score scaling ensures proper feature weights.
Machine Learning	k-NN Algorithm	Normalization	KNN is distance-based; scaling prevents one feature from dominating others.
Healthcare	Medical Imaging	Normalization	Pixel intensities are scaled (0 to 1) for better visualization and deep learning models.
Finance	Stock Prediction	Standardization	Stock prices vary widely; standardization makes comparisons meaningful.
E-commerce	Purchases Data	Normalization	Standardizes transaction amounts and frequency for better customer segmentation.
Social Media Analytics	Sentiment Analysis	Standardization	Engagement and word frequency metrics often follow a normal distribution.
Supply Chain	Demand Forecast	Standardization	Ensures seasonal fluctuations are balanced for forecasting accuracy.
Image Processing	Object Detection	Normalization	Image pixel values are scaled, improving computational efficiency.
NLP	Word Embeddings	Standardization	Normalizing word frequency enhances embedding quality and training efficiency.

4.3 Creating and Managing Dummy Variables

Dummy variables are binary (0/1) indicators used in regression models, classification tasks, and machine learning to represent categorical data numerically. Many statistical and machine learning algorithms require numerical inputs, making dummy variables essential for handling categorical variables.

4.3.1 Why Are Dummy Variables Needed?

Some models, like linear regression, do not work directly with categorical data.
Variables such as Gender (Male/Female) or Customer Type (New/Returning) need to be converted into numbers.
Dummy variables help in representing categorical effects in statistical and predictive models.

4.3.2 Creating Dummy Variables in R

Using `ifelse()` for Binary Categories

If a variable has two categories (e.g., Male/Female), we can create a dummy variable using ifelse().

Using `model.matrix()` for Multiple Categories

For categorical variables with more than two categories, one-hot encoding is used to create multiple dummy variables.

Note: The -1 in model.matrix(~ Department - 1, data) removes the intercept, ensuring that each category is represented separately.

4.3.3 Managing Dummy Variables

When working with dummy variables, consider the following:

Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid multicollinearity.
Check for Redundant Variables: If one category can be predicted from the others, remove it.
Scaling Dummy Variables: If using models like KNN or clustering, scaling may be required.

To remove one dummy variable in regression analysis:

4.3.4 Applications of Dummy Variables

Application Domain	Example	Use Case
Regression Analysis	Car Type (Sedan/SUV/Truck)	Convert car types into dummy variables for predicting car prices.
Customer Segmentation	Customer Type (New/Returning)	Categorize customers into segments for personalized marketing strategies.
Fraud Detection	Payment Method (Credit/Debit)	Detect potential fraud by analyzing payment method patterns.
Healthcare Analytics	Disease Presence (Yes/No)	Transform categorical disease labels into binary format for machine learning models.

Dummy variables convert categorical data into numerical format for better processing.
Avoid multicollinearity by removing one category in regression models.
For large datasets, consider using the fastDummies package for efficient dummy variable creation.
Using dummy variables effectively helps improve statistical modeling and machine learning performance, ensuring categorical data can be used in predictive models.

Summary

Concept	Description
Data manipulations and transformations
Importing Data into R	Importing data into R is a crucial step in data analysis
Data file creation in R	use data.frame() function
Viewing and Exploring Data	Key concept under Data manipulations and transformations
Handling missing values and imputations	Key concept under Data manipulations and transformations
Data Manipulation with `dplyr`	The dplyr package provides functions for filtering, selecting, modifying, and restructuring data
Reshaping Data with `tidyr`	Key concept under Data manipulations and transformations
Exporting Data from R	Key concept under Data manipulations and transformations
Normalization and Standardization in R
Normalization	Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]
Standardization	Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation
Normalization vs Standardization	\| Feature \| Normalization \| Standardization \|
Applications of Norm and Stand	\| Domain \| Use Case \| Method \| Reason for Preference \|
Creating and Managing Dummy Variables
Why Are Dummy Variables Needed?	Some models, like linear regression, do not work directly with categorical data
Creating Dummy Variables in R	Key concept under Creating and Managing Dummy Variables
Managing Dummy Variables	When working with dummy variables, consider the following: Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid
Applications of Dummy Variables	\| Application Domain \| Example \| Use Case \|

4.1 Data manipulations and transformations

Data Handling in R

4.1.1 Importing Data into R

Import csv file from website

Import csv file from local folder

Import excel file from local folder

Install readxl package

Sample data file

Import Data from SQL database

4.1.2 Data file creation in R

4.1.3 Viewing and Exploring Data

Explore the Data

Checking Data Structure

Checking Missing Values

Checking Duplicates

4.1.4 Handling missing values and imputations

Replace missing values with mean

Remove missing values

Remove Duplicates

Changing Data Types

4.1.5 Data Manipulation with dplyr

Selecting Specific Columns

Removing Columns

Filtering Data

Sorting Data

4.1.6 Reshaping Data with tidyr

Converting Wide Data to Long Format

Converting long data to wide format

4.1.7 Exporting Data from R

Export the data file as a csv file to local folder

Export data file as a excel file to local folder

writexl: Writing Data to Excel Files

Key Features:

Install writexl package

To check the list of files present in the directory (or project)

4.2 Normalization and Standardization in R

4.2.1 Normalization

Implementation of Normalization in R

4.2.2 Standardization

Implementation of Standardization in R

4.2.3 Key Differences Between Normalization and Standardization

When to Use Normalization vs. Standardization?

4.2.4 Practical Applications of Normalization and Standardization

4.3 Creating and Managing Dummy Variables

4.3.1 Why Are Dummy Variables Needed?

4.3.2 Creating Dummy Variables in R

Using ifelse() for Binary Categories

Using model.matrix() for Multiple Categories

4.3.3 Managing Dummy Variables

4.3.4 Applications of Dummy Variables

Summary

4.1.5 Data Manipulation with `dplyr`

4.1.6 Reshaping Data with `tidyr`

Using `ifelse()` for Binary Categories

Using `model.matrix()` for Multiple Categories