Linear regression is a foundational technique in statistical modeling and machine learning. It predicts a continuous response variable based on one or multiple predictor variables. In this detailed guide, we’ll explore how to perform and interpret linear regression in R.
1. Introduction to Linear Regression
Linear regression is based on the linear relationship between the dependent and independent variables. It predicts the outcome of the dependent variable based on the values of the independent variables. The model can be expressed as:

Where:
- Y is the response or dependent variable.
- X1,X2,…Xp are the predictor or independent variables.
- β0 is the intercept, and β1,β2,…βp are coefficients.
- ϵ represents the error term.
2. Setting Up the Environment in R
Before diving into the analysis, ensure you have R and RStudio (optional, but recommended) installed.
Step 1: Installing Necessary Packages
install.packages("ggplot2")
install.packages("car")
Step 2: Loading Packages
library(ggplot2)
library(car)
3. Basic Linear Regression in R
For illustration, let’s use R’s built-in dataset, mtcars
.
Step 1: Understanding the Data
View the first few rows using:
head(mtcars)
For this example, we’ll predict mpg
(miles per gallon) based on the car’s weight (wt
).
Step 2: Fitting the Model
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
The summary()
function provides detailed information about the model, including coefficients, R-squared values, and significance levels.
4. Interpreting the Results
- Coefficients: These values indicate the change in the response variable for a one-unit change in the predictor. For instance, in our example, the coefficient for
wt
tells us how muchmpg
changes for every unit increase in weight. - R-squared: This value represents the proportion of the variance in the dependent variable that’s predictable from the independent variables. Higher R-squared values denote a better fit, but it’s crucial not to solely rely on it.
- p-value: This tests the null hypothesis that a coefficient is equal to zero (no effect). A low p-value indicates that the predictors have a significant relationship with the response variable.
5. Multiple Linear Regression
If you want to predict mpg
based on multiple predictors, say wt
and hp
(horsepower), the process remains largely the same:
model_multi <- lm(mpg ~ wt + hp, data = mtcars)
summary(model_multi)
6. Checking Assumptions
Linear regression relies on several assumptions. Violating these can impact the reliability of your results.
1. Linearity: The relationship between predictors and response should be linear. Check this using scatterplots or residual plots:
plot(mtcars$wt, mtcars$mpg)
residualPlots(model)

2. Independence: The residuals (errors) should be independent. Time series data might violate this.
3. Homoscedasticity: The variance of residuals should be constant. Residual plots can also help diagnose this.
4. Normality: Residuals should be approximately normally distributed. You can use a Q-Q plot to check:
qqPlot(model, main="Q-Q Plot")

7. Diagnostics and Model Improvement
If assumptions are violated, consider:
- Transforming Variables: For instance, taking the log or square root.
- Adding Interaction Terms: If the effect of one variable depends on another.
- Using Polynomial Regression: If the relationship is curvilinear.
8. Predicting New Data
With your model in place, predict new data using:
new_data <- data.frame(wt = c(2.5, 3.5))
predictions <- predict(model, new_data)
print(predictions)
9. Conclusion
Linear regression is a powerful tool in the hands of data analysts and scientists. R provides comprehensive support for fitting, diagnosing, and interpreting linear models. Remember always to check model assumptions, consider the real-world implications of your models, and approach the results with a critical mind.