10 Regression Models
Regression models are a fundamental part of supervised learning, used to predict continuous numerical values based on input variables. These models identify relationships between independent variables (features) and a dependent variable (target) to make predictions.
10.1 Regression Models
Regression is widely applied in fields such as finance, economics, healthcare, marketing, and agriculture for forecasting, trend analysis, and decision-making (Saeed Khaki & Lizhi Wang, 2019).
10.1.1 Summary of Regression Models
| Model | Use Case | Advantages | Limitations |
|---|---|---|---|
| Linear Regression | House price prediction, salary estimation | Simple, easy to interpret | Assumes linearity, sensitive to outliers |
| Nonlinear Regression | Population growth, disease spread | Models complex relationships | Harder to interpret, computationally expensive |
| Multiple Regression | Predicting demand based on multiple factors | Captures multiple influences | Overfitting risk, requires careful variable selection |
| Polynomial Regression | Economic cycles, trajectory prediction | Fits curved trends | Overfitting with high-degree polynomials |
| Quantile Regression | Risk modeling, income distribution | Robust to outliers | Computationally intensive |
Regression models form the backbone of predictive analytics, enabling accurate forecasting and decision-making in various domains, including business, healthcare, finance, and agriculture.
10.2 Linear Regression
Linear regression is the simplest and most commonly used regression model. It establishes a linear relationship between independent variables (X) and a dependent variable (Y) using a straight-line equation:
\[ Y = \beta_0 + \beta_1 X + \varepsilon \]
where:
- β₀ = Intercept (constant term)
- β₁ = Slope (coefficient)
- X = Independent variable
- ε = Error term (residual)
Example Applications
- Predicting house prices based on square footage and location.
- Forecasting sales revenue using advertising spend.
- Estimating employee salaries based on experience and education.
Advantages
- Easy to interpret and implement.
- Works well for data with a linear relationship.
- Computationally efficient.
Limitations
- Assumes a linear relationship, which may not always be the case.
- Sensitive to outliers.
10.2.1 Example: Linear Regression
Problem Statement
A real estate company wants to predict house prices based on square footage. The company collected a sample of 30 houses with their respective sizes (in square feet) and prices (in $1000s).
Sample Dataset
Below is the dataset containing 30 observations.
| House Size (sq.ft) | Price ($1000s) |
|---|---|
| 1500 | 300 |
| 1800 | 340 |
| 2100 | 400 |
| 2500 | 450 |
| 1300 | 260 |
| 1700 | 320 |
| 2200 | 420 |
| 2700 | 480 |
| 1600 | 310 |
| 1400 | 280 |
| 1900 | 360 |
| 2300 | 430 |
| 2800 | 490 |
| 2900 | 510 |
| 2000 | 370 |
| 2400 | 440 |
| 3000 | 520 |
| 2600 | 460 |
| 3100 | 530 |
| 3200 | 550 |
| 3300 | 570 |
| 3400 | 590 |
| 3500 | 610 |
| 3600 | 630 |
| 3700 | 650 |
| 3800 | 670 |
| 3900 | 690 |
| 4000 | 710 |
| 4100 | 730 |
| 4200 | 750 |
10.2.2 Performing Linear Regression in R
The linear regression model will be built using lm() function in R. The goal is to fit a model:
\[ \text{Price} = \beta_0 + \beta_1 \times \text{Size} \]
R Code Implementation
10.2.3 Interpretation of Results
R Output Explanation
- Intercept (\(\beta_0\) = 46.6585): Represents the baseline price when the house size is zero. Although a house with zero square footage is unrealistic, this value is necessary for the linear equation.
- Slope (\(\beta_1\) = 0.16267): Indicates that for each additional 1 sq.ft, the house price increases by $162.67.
The linear regression equation based on the output is:
\[ \text{Price} = 46.6585 + 0.16267 \times \text{Size} \]
Example Calculation
For a 2500 sq.ft house:
\[ \text{Price} = 46.6585 + (0.16267 \times 2500) \]
\[ = 46.6585 + 406.675 \]
\[ = 453.33 \]
Since the prices are in $1000s, the predicted price for a 2500 sq.ft house is $453,330.
Note:
- The intercept (\(\beta_0\) = 46.6585) is a mathematical reference point but may not have a direct real-world meaning.
- The slope (\(\beta_1\) = 0.16267) tells us that for every extra 1000 sq.ft, the price increases by approximately $162,670.
- This model allows us to estimate house prices based on size, assuming all other factors remain constant.
- You can use this equation to predict house prices for any given size.
10.2.4 Performing Linear Regression in Excel
Steps to Perform Regression in Excel
-
Enter the Data:
- Open Excel and enter the Size in column A and Price in column B.
-
Using Data Analysis Tool:
- Go to Data → Data Analysis → Regression.
- Select Input Y Range (Price column).
- Select Input X Range (Size column).
- Click OK.
-
Interpret the Results:
- Intercept (\(\beta_0\)): Represents the base price when size is 0.
- Slope (\(\beta_1\)): Represents the increase in price per square foot.
10.2.5 Performing Linear Regression in SPSS
Load Data from Excel into SPSS
- Open SPSS.
- Click on File → Open → Data.
- Select Excel (.xls, .xlsx) as the file type and browse to your Excel file.
- Ensure the “Read variable names from the first row of data” option is checked.
- Click Open to import the data.
Run the Regression Analysis
- Click on Analyze → Regression → Linear.
- In the Linear Regression dialog box:
- Move Price to the Dependent variable box.
- Move Size to the Independent(s) variable box.
- Click OK to run the regression.
Interpret the Results
- Intercept (\(\beta_0\)): Represents the base price when the house size is zero.
- Slope (\(\beta_1\)): Represents the additional price per square foot.
- The R-squared value indicates how well the model explains the variation in house prices.
- The p-value for Size determines whether the relationship between size and price is statistically significant.
10.3 Multiple Regression
10.3.1 Overview
Multiple regression extends linear regression by incorporating multiple independent variables to predict a single dependent variable. The equation is:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \varepsilon \]
where:
- X₁, X₂, …, Xn are multiple independent variables.
- β₀, β₁, …, βn are coefficients.
Example Applications
- Predicting a car’s fuel efficiency based on weight, horsepower, and engine size.
- Estimating student performance based on study hours, attendance, and parental education.
- Forecasting demand for a product using advertising spend, economic indicators, and competitor pricing.
Advantages
- Captures multiple factors affecting the target variable.
- Provides a more comprehensive predictive model.
Limitations
- Higher complexity increases the risk of overfitting.
- Requires careful selection of variables to avoid multicollinearity.
10.4 Polynomial Regression
10.4.1 Overview
Polynomial regression extends linear regression by adding higher-degree polynomial terms to model curved relationships.
The equation is:
Y = β₀ + β₁X + β₂X² + β₃X³ + ... + βnX^n + ε
where:
- X², X³, …, Xⁿ are polynomial terms capturing the curvature in data.
Example Applications
- Predicting the trajectory of a projectile in physics.
- Modeling economic cycles where trends fluctuate over time.
- Fitting complex growth curves in biology and medicine.
Advantages
- Provides better accuracy than linear regression for non-linear data.
- Captures curved trends that linear models miss.
Limitations
- Prone to overfitting if the polynomial degree is too high.
- More complex than simple linear regression.
Example: Polynomial Regression
A real estate company wants to predict house prices based on square footage. The company believes that price does not increase linearly with size, but follows a non-linear pattern. To capture this relationship, they decide to use Polynomial Regression.
Sample Dataset
Below is a dataset containing 30 observations.
| House Size (sq.ft) | Price ($1000s) |
|---|---|
| 1500 | 300 |
| 1800 | 340 |
| 2100 | 400 |
| 2500 | 450 |
| 1300 | 260 |
| 1700 | 320 |
| 2200 | 420 |
| 2700 | 480 |
| 1600 | 310 |
| 1400 | 280 |
| 1900 | 360 |
| 2300 | 430 |
| 2800 | 490 |
| 2900 | 510 |
| 2000 | 370 |
| 2400 | 440 |
| 3000 | 520 |
| 2600 | 460 |
| 3100 | 530 |
| 3200 | 550 |
| 3300 | 570 |
| 3400 | 590 |
| 3500 | 610 |
| 3600 | 630 |
| 3700 | 650 |
| 3800 | 670 |
| 3900 | 690 |
| 4000 | 710 |
| 4100 | 730 |
| 4200 | 750 |
10.4.2 Performing Polynomial Regression in R
Polynomial regression is useful when linear models do not capture the pattern in the data.
R Code Implementation
10.4.3 Interpretation of Results
R Output Explanation
The polynomial regression model captures the curved relationship between house size and price. The inclusion of a quadratic term ((^2)) allows the model to account for non-linearity in the data.
Regression Equation
The fitted polynomial regression equation from the R output is:
\[ \text{Price} = 124.9 - 0.094 \times \text{Size} + 9.538 \times 10^{-5} \times \text{Size}^2 \]
Interpretation of Coefficients
-
Intercept (\(\beta_0 = 124.9\)): Represents the estimated base price when the house size is zero. While a size of zero is unrealistic, this value serves as a mathematical reference point.
-
Linear Term (\(\beta_1 = -0.094\)): The first-degree coefficient suggests that, initially, as Size increases, the impact on Price is negative. However, this alone does not define the final trend since the quadratic term modifies the relationship.
- Quadratic Term (\(\beta_2 = 9.538 \times 10^{-5}\)): The positive second-degree term dominates for larger house sizes, creating a U-shaped relationship where Price first decreases slightly and then increases significantly as Size grows.
Example Calculation for a 2500 sq.ft House
Using the regression equation:
\[ \text{Price} = 124.9 + (-0.094 \times 2500) + (9.538 \times 10^{-5} \times 2500^2) \]
\[ = 124.9 - 235 + (9.538 \times 6,250,000) \]
\[ = 124.9 - 235 + 596.1 \]
\[ = 486 \]
Thus, the predicted house price for a 2500 sq.ft home is $486,000.
10.4.4 Performing Polynomial Regression in Excel
Steps to Perform Regression in Excel
-
Enter the Data:
- Open Excel and enter the Size in column A and Price in column B.
-
Create Additional Columns for Polynomial Terms:
- In column C, compute Size² using
=A2^2.
- In column C, compute Size² using
-
Using Data Analysis Tool:
- Go to Data → Data Analysis → Regression.
- Select Input Y Range (Price column).
- Select Input X Range (Size and Size² columns).
- Click OK.
-
Interpret the Results:
- Intercept (\(\beta_0\)): Represents the base price when size is 0.
- Size Coefficient (\(\beta_1\)): Linear term.
- Size² Coefficient (\(\beta_2\)): Captures non-linearity.
10.4.5 Performing Polynomial Regression in SPSS
Load Data from Excel into SPSS
- Open SPSS.
- Click on File → Open → Data.
- Select Excel (.xls, .xlsx) as the file type and browse to your Excel file.
- Ensure the “Read variable names from the first row of data” option is checked.
- Click Open to import the data.
Run the Regression Analysis
- Click on Analyze → Regression → Linear.
- In the Linear Regression dialog box:
- Move Price to the Dependent variable box.
- Move Size and Size² to the Independent(s) variable box.
- Click OK to run the regression.
Interpret the Results
- Intercept (\(\beta_0\)): Represents the base price.
- Size (\(\beta_1\)): Linear effect.
- Size² (\(\beta_2\)): Captures non-linearity in house prices.
10.5 Nonlinear Regression
10.5.1 Overview
Nonlinear regression is used when the relationship between X and Y is not linear. It models complex patterns using curved functions such as exponential, logarithmic, and power functions.
Common nonlinear regression equations:
- Exponential:
Y = a * e^(bX) - Logarithmic:
Y = a + b * log(X) - Power:
Y = a * X^b
Example Applications
- Modeling population growth using an exponential function.
- Predicting disease spread in epidemiology.
- Analyzing chemical reaction rates in physics and chemistry.
Advantages
- Can model more complex relationships than linear regression.
- Provides better accuracy when data is not linearly distributed.
Limitations
- More complex to interpret and implement.
- Requires more computational power.
10.6 Quantile Regression
10.6.1 Overview
Quantile regression estimates conditional quantiles of the dependent variable instead of just predicting the mean (as in linear regression). This makes it more robust to outliers and useful for heterogeneous distributions.
Instead of:
Y = β₀ + β₁X + ε
Quantile regression models different quantiles (e.g., 25th, 50th, 75th percentiles) by minimizing an asymmetric loss function.
Example Applications
- House price estimation for different market segments (luxury, mid-range, affordable).
- Income distribution modeling across various economic groups.
- Risk assessment in finance by predicting high-loss scenarios.
Advantages
- More robust to outliers than standard linear regression.
- Suitable for modeling skewed and heterogeneous data.
Limitations
- Computationally more complex.
- Harder to interpret compared to standard linear regression.
Summary
| Concept | Description |
|---|---|
| Regression Models | |
| Regression Models | Regression is widely applied in fields such as finance, economics, healthcare, marketing, and agriculture for forecasting, trend analysis, and decision-making |
| **Linear Regression** | |
| Example: Linear Regression | Key concept under **Linear Regression** |
| Performing Linear Regression in R | The linear regression model will be built using lm() function in R |
| Interpretation of Results | Key concept under **Linear Regression** |
| Performing Linear Regression in Excel | Key concept under **Linear Regression** |
| Performing Linear Regression in SPSS | Key concept under **Linear Regression** |
| **Multiple Regression** | |
| Overview | Multiple regression extends linear regression by incorporating multiple independent variables to predict a single dependent variable |
| **Polynomial Regression** | |
| Overview | Polynomial regression extends linear regression by adding higher-degree polynomial terms to model curved relationships |
| Performing Polynomial Regression in R | Polynomial regression is useful when linear models do not capture the pattern in the data |
| Interpretation of Results | Key concept under **Polynomial Regression** |
| Performing Polynomial Regression in Excel | Key concept under **Polynomial Regression** |
| Performing Polynomial Regression in SPSS | Key concept under **Polynomial Regression** |
| **Nonlinear Regression** | |
| Overview | Nonlinear regression is used when the relationship between X and Y is not linear |
| **Quantile Regression** | |
| Overview | Quantile regression estimates conditional quantiles of the dependent variable instead of just predicting the mean (as in linear regression) |