4 Data Preparation and Transformation in R
Data preparation and transformation are crucial steps in data analysis, ensuring that raw data is cleaned, structured, and formatted for further analysis (Hadley Wickham, 2014). In R, several functions and packages help perform these tasks efficiently. This chapter covers essential data preparation techniques, including data import, cleaning, transformation, and manipulation.
4.1 Data manipulations and transformations
Data Handling in R
Data Handling in R refers to the importing, managing, transforming, and cleaning of data before analysis. R provides powerful functions and packages such as readr, dplyr, tidyr, and data.table to efficiently handle large datasets.
4.1.1 Importing Data into R
Importing data into R is a crucial step in data analysis. R offers various functions for loading data from different file types.
- R supports importing data from various sources, including ** Excel, CSV, databases(SQL), JSON, and APIs**.
Import csv file from website
- Paste the csv website link inside the read.csv() function.
- Sample csv file website link
Import csv file from local folder
Sample csv data file - Download the csv dataset from this link.
copy the file path name or (if working directory is set) of the dataset and paste it inside the read.csv("file name")
- To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
- for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Import excel file from local folder
readxl: To read Excel files in R, you need to install the readxl package.
- The
readxlpackage allows users to import Excel spreadsheets (.xlsand.xlsx) into R for further analysis. - Supports both
.xls(Excel 97-2003) and.xlsx(Excel 2007+) formats. - Reads specific sheets, ranges, and named regions.
Install readxl package
Code
install.packages("readxl")Sample data file
Download the excel dataset from this link.
copy the file name (or file path name) of the dataset and paste it inside the function read_excel("file_name")
- To copy the path name of a file, right click on the file and select copy path name or use the keyboard shortcut:
- for Windows- ctrl+shift+c for Mac- cmnd+opt+c
Code
library(readxl)
Europeanagriculture = read_excel("Europeanagriculture.xlsx", sheet = 1)
head(Europeanagriculture)Import Data from SQL database
- First download the sql database file from this link
- Save the sql file in your directory.
- Install the required packages shown below.
install.packages("DBI") - Database interface package install.packages("RSQLite") - SQLite database driver (for MySQL, use RMariaDB) install.packages("dplyr") - For data manipulation
Code
Install the required packages for sql
install.packages("DBI")
install.packages("RSQLite")
install.packages("dplyr") 4.1.2 Data file creation in R
use data.frame() function ::: {.panel-tabset}
:::
4.1.3 Viewing and Exploring Data
Explore the Data
Checking Data Structure
-
View()- To view the entire table in a window -
head()- View the first few rows -
str()- Check structure of dataset -
summary()- Summary statistics
Checking Missing Values
Checking Duplicates
4.1.4 Handling missing values and imputations
Replace missing values with mean
Remove missing values
Remove Duplicates
Changing Data Types
4.1.5 Data Manipulation with dplyr
The dplyr package provides functions for filtering, selecting, modifying, and restructuring data.
Selecting Specific Columns
- Install dplyr package
Code
install.packages("dplyr")Removing Columns
Filtering Data
Sorting Data
4.1.6 Reshaping Data with tidyr
Converting Wide Data to Long Format
Converting long data to wide format
4.1.7 Exporting Data from R
Export the data file as a csv file to local folder
The write.csv() function in R is used to export a dataset into a Comma-Separated Values (CSV) file, making it easy to save, share, and analyze data in other tools like Excel, Python, or SQL.
write.csv(data, "folder path/filename.csv") To save the CSV in a specific folder, provide the path.
Export data file as a excel file to local folder
writexl: Writing Data to Excel Files
The writexl package provides an easy way to export data from R into an Excel file without requiring external dependencies.
Key Features:
- Writes
.xlsxfiles quickly. - Preserves column types and formats.
Install writexl package
- export the clean data file as xlsx file.
Code
library(writexl)
# Write data to an Excel file
write_xlsx(clean, "cleanstudentsdata.xlsx")To check the list of files present in the directory (or project)
4.2 Normalization and Standardization in R
Normalization and standardization are essential techniques in data preprocessing to scale numeric variables, ensuring that they contribute equally to statistical analysis and machine learning models. Since different variables may have varying ranges, applying these transformations helps prevent bias towards larger values and enhances the efficiency of algorithms.
4.2.1 Normalization
Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1]. This method ensures that all features have the same scale, preventing variables with large ranges from dominating those with smaller ranges.
Formula for Min-Max Normalization \[ X’ = \frac{X - X_{min}}{X_{max} - X_{min}} \]
where:
- \(X{\prime}\) is the normalized value.
- \(X\) is the original value. -\(X_{min}\) and \(X_{max}\) are the minimum and maximum values in the dataset.
When to Use Normalization?
- When data follows a non-Gaussian (skewed) distribution.
- When features need to be scaled between 0 and 1 (e.g., for machine learning).
- When working with neural networks and distance-based algorithms (e.g., k-NN, SVM).
Implementation of Normalization in R
🔹 Output: The values are now between 0 and 1, ensuring equal contribution to models.
4.2.2 Standardization
Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation. Unlike min-max normalization, it does not bound values to a fixed range.
Formula for Z-score Standardization
\[ X’ = \frac{X - \mu}{\sigma} \]
where:
- \(X{\prime}\) is the standardized value.
- \(X\) is the original value.
- \(\mu\) is the mean of the dataset.
- \(\sigma\) is the standard deviation of the dataset.
When to Use Standardization?
- When data follows a normal (Gaussian) distribution.
- When using linear regression, logistic regression, PCA, and clustering.
- When handling outliers, as it is less sensitive to extreme values than normalization.
Implementation of Standardization in R
🔹 Output: The transformed data now has zero mean (0) and a standard deviation of 1, making it suitable for statistical modeling.
4.2.3 Key Differences Between Normalization and Standardization
| Feature | Normalization | Standardization |
|---|---|---|
| Definition | Rescales values to a fixed range (typically 0 to 1). | Transforms values to have a mean of 0 and a standard deviation of 1. |
| Formula | \[ X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}} \] | \[ X' = \frac{X - \mu}{\sigma} \] |
| Effect on Data | Retains original distribution but compresses values into a fixed range. | Adjusts the data to a standard normal distribution (bell-shaped). |
| Range | Typically [0,1] (Min-Max Scaling). | Centered around 0, without fixed bounds. |
| Best Used When | Data does not follow a normal distribution. | Data follows (or approximately follows) a normal distribution. |
| Sensitive to Outliers? | Yes – A single extreme value can distort scaling. | No – Since standardization uses mean and standard deviation, it is more robust to outliers. |
| Example Dataset | Heights of individuals in cm (e.g., 150 cm – 200 cm). | Exam scores that vary in scale (e.g., 0-100 vs. 0-10). |
| Common Applications | Neural Networks, KNN, Distance-based models (SVM, Clustering). | Linear Regression, Logistic Regression, PCA. |
When to Use Normalization vs. Standardization?
-
Use Normalization when:
- Your data has varying scales and does not follow a normal distribution.
- You are using distance-based models like KNN, K-Means, Neural Networks.
- Features have different units (e.g., age in years, salary in dollars).
-
Use Standardization when:
- Your data follows a Gaussian (Normal) distribution.
- You are using models that assume normality, like Linear Regression, PCA, Logistic Regression.
- You need a dataset where outliers have less impact.
4.2.4 Practical Applications of Normalization and Standardization
| Domain | Use Case | Method | Reason for Preference |
|---|---|---|---|
| Machine Learning | Neural Networks | Normalization | Keeps feature values within (0,1) range, improving model convergence. |
| Machine Learning | PCA | Standardization | PCA assumes normally distributed data; z-score scaling ensures proper feature weights. |
| Machine Learning | k-NN Algorithm | Normalization | KNN is distance-based; scaling prevents one feature from dominating others. |
| Healthcare | Medical Imaging | Normalization | Pixel intensities are scaled (0 to 1) for better visualization and deep learning models. |
| Finance | Stock Prediction | Standardization | Stock prices vary widely; standardization makes comparisons meaningful. |
| E-commerce | Purchases Data | Normalization | Standardizes transaction amounts and frequency for better customer segmentation. |
| Social Media Analytics | Sentiment Analysis | Standardization | Engagement and word frequency metrics often follow a normal distribution. |
| Supply Chain | Demand Forecast | Standardization | Ensures seasonal fluctuations are balanced for forecasting accuracy. |
| Image Processing | Object Detection | Normalization | Image pixel values are scaled, improving computational efficiency. |
| NLP | Word Embeddings | Standardization | Normalizing word frequency enhances embedding quality and training efficiency. |
4.3 Creating and Managing Dummy Variables
Dummy variables are binary (0/1) indicators used in regression models, classification tasks, and machine learning to represent categorical data numerically. Many statistical and machine learning algorithms require numerical inputs, making dummy variables essential for handling categorical variables.
4.3.1 Why Are Dummy Variables Needed?
- Some models, like linear regression, do not work directly with categorical data.
- Variables such as Gender (Male/Female) or Customer Type (New/Returning) need to be converted into numbers.
- Dummy variables help in representing categorical effects in statistical and predictive models.
4.3.2 Creating Dummy Variables in R
Using ifelse() for Binary Categories
If a variable has two categories (e.g., Male/Female), we can create a dummy variable using ifelse().
Using model.matrix() for Multiple Categories
For categorical variables with more than two categories, one-hot encoding is used to create multiple dummy variables.
Note: The -1 in model.matrix(~ Department - 1, data) removes the intercept, ensuring that each category is represented separately.
4.3.3 Managing Dummy Variables
When working with dummy variables, consider the following:
- Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid multicollinearity.
- Check for Redundant Variables: If one category can be predicted from the others, remove it.
- Scaling Dummy Variables: If using models like KNN or clustering, scaling may be required.
To remove one dummy variable in regression analysis:
4.3.4 Applications of Dummy Variables
| Application Domain | Example | Use Case |
|---|---|---|
| Regression Analysis | Car Type (Sedan/SUV/Truck) | Convert car types into dummy variables for predicting car prices. |
| Customer Segmentation | Customer Type (New/Returning) | Categorize customers into segments for personalized marketing strategies. |
| Fraud Detection | Payment Method (Credit/Debit) | Detect potential fraud by analyzing payment method patterns. |
| Healthcare Analytics | Disease Presence (Yes/No) | Transform categorical disease labels into binary format for machine learning models. |
Dummy variables convert categorical data into numerical format for better processing.
Avoid multicollinearity by removing one category in regression models.
For large datasets, consider using the
fastDummiespackage for efficient dummy variable creation.Using dummy variables effectively helps improve statistical modeling and machine learning performance, ensuring categorical data can be used in predictive models.
Summary
| Concept | Description |
|---|---|
| Data manipulations and transformations | |
| Importing Data into R | Importing data into R is a crucial step in data analysis |
| Data file creation in R | use data.frame() function |
| Viewing and Exploring Data | Key concept under Data manipulations and transformations |
| Handling missing values and imputations | Key concept under Data manipulations and transformations |
| Data Manipulation with `dplyr` | The dplyr package provides functions for filtering, selecting, modifying, and restructuring data |
| Reshaping Data with `tidyr` | Key concept under Data manipulations and transformations |
| Exporting Data from R | Key concept under Data manipulations and transformations |
| Normalization and Standardization in R | |
| Normalization | Normalization (also called min-max scaling) is a technique that rescales numeric variables into a fixed range, usually [0,1] or [-1,1] |
| Standardization | Standardization (also known as z-score normalization) transforms data by centering it around the mean (0) and scaling it based on standard deviation |
| Normalization vs Standardization | | Feature | Normalization | Standardization | |
| Applications of Norm and Stand | | Domain | Use Case | Method | Reason for Preference | |
| Creating and Managing Dummy Variables | |
| Why Are Dummy Variables Needed? | Some models, like linear regression, do not work directly with categorical data |
| Creating Dummy Variables in R | Key concept under Creating and Managing Dummy Variables |
| Managing Dummy Variables | When working with dummy variables, consider the following: Avoid the Dummy Variable Trap: In regression models, one dummy variable should be dropped to avoid |
| Applications of Dummy Variables | | Application Domain | Example | Use Case | |