How to Perform Linear Regression in R

Spread the love

Linear regression is a foundational technique in statistical modeling and machine learning. It predicts a continuous response variable based on one or multiple predictor variables. In this detailed guide, we’ll explore how to perform and interpret linear regression in R.

1. Introduction to Linear Regression

Linear regression is based on the linear relationship between the dependent and independent variables. It predicts the outcome of the dependent variable based on the values of the independent variables. The model can be expressed as:


  • Y is the response or dependent variable.
  • X1,X2,…Xp​ are the predictor or independent variables.
  • β0 is the intercept, and β1,β2,…βp​ are coefficients.
  • ϵ represents the error term.

2. Setting Up the Environment in R

Before diving into the analysis, ensure you have R and RStudio (optional, but recommended) installed.

Step 1: Installing Necessary Packages


Step 2: Loading Packages


3. Basic Linear Regression in R

For illustration, let’s use R’s built-in dataset, mtcars.

Step 1: Understanding the Data

View the first few rows using:


For this example, we’ll predict mpg (miles per gallon) based on the car’s weight (wt).

Step 2: Fitting the Model

model <- lm(mpg ~ wt, data = mtcars)

The summary() function provides detailed information about the model, including coefficients, R-squared values, and significance levels.

4. Interpreting the Results

  • Coefficients: These values indicate the change in the response variable for a one-unit change in the predictor. For instance, in our example, the coefficient for wt tells us how much mpg changes for every unit increase in weight.
  • R-squared: This value represents the proportion of the variance in the dependent variable that’s predictable from the independent variables. Higher R-squared values denote a better fit, but it’s crucial not to solely rely on it.
  • p-value: This tests the null hypothesis that a coefficient is equal to zero (no effect). A low p-value indicates that the predictors have a significant relationship with the response variable.

5. Multiple Linear Regression

If you want to predict mpg based on multiple predictors, say wt and hp (horsepower), the process remains largely the same:

model_multi <- lm(mpg ~ wt + hp, data = mtcars)

6. Checking Assumptions

Linear regression relies on several assumptions. Violating these can impact the reliability of your results.

1. Linearity: The relationship between predictors and response should be linear. Check this using scatterplots or residual plots:

plot(mtcars$wt, mtcars$mpg)

2. Independence: The residuals (errors) should be independent. Time series data might violate this.

3. Homoscedasticity: The variance of residuals should be constant. Residual plots can also help diagnose this.

4. Normality: Residuals should be approximately normally distributed. You can use a Q-Q plot to check:

qqPlot(model, main="Q-Q Plot")

7. Diagnostics and Model Improvement

If assumptions are violated, consider:

  • Transforming Variables: For instance, taking the log or square root.
  • Adding Interaction Terms: If the effect of one variable depends on another.
  • Using Polynomial Regression: If the relationship is curvilinear.

8. Predicting New Data

With your model in place, predict new data using:

new_data <- data.frame(wt = c(2.5, 3.5))
predictions <- predict(model, new_data)

9. Conclusion

Linear regression is a powerful tool in the hands of data analysts and scientists. R provides comprehensive support for fitting, diagnosing, and interpreting linear models. Remember always to check model assumptions, consider the real-world implications of your models, and approach the results with a critical mind.

Posted in RTagged

Leave a Reply