# How to Create a Scatterplot with a Regression Line in R

Regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. In this article, we will focus on the basics of creating a scatterplot with a regression line in R, using both base R and the popular ggplot2 package.

## Introduction to Scatterplots and Regression Lines

A scatterplot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. Each observation (or point) in the plot corresponds to one row in the data set. Scatterplots are used to visualize the relationship between two quantitative variables, and they are especially useful for interpreting trends in data.

The regression line, also known as the line of best fit, is a plot of the predicted values for the dependent variable (Y) as a function of the independent variable (X). When added to a scatterplot, the regression line helps us understand any linear relationship between the two variables. A steeper slope indicates a stronger relationship between the variables.

## Using Built-in Data in R

In this guide, we will use the built-in mtcars data set in R, which contains various car attributes for 32 models. You can take a look at the data using the head function:

head(mtcars)

## Creating Scatterplots and Regression Lines in Base R

Let’s create a scatterplot for the mtcars data set, showing mpg (miles per gallon) as a function of hp (horsepower), and then add a regression line.

### Scatterplot

To create a scatterplot in base R, we use the plot function:

plot(mtcars$mpg ~ mtcars$hp, main = "Scatterplot of MPG vs HP", xlab = "Horsepower", ylab = "Miles Per Gallon")

Here, xlab and ylab are used to provide labels for the x-axis and y-axis, respectively, while main is used to provide a title for the plot.

### Regression Line

Next, we create a linear model using the lm function and add a regression line to the scatterplot using the abline function:


model <- lm(mtcars$mpg ~ mtcars$hp)
plot(mtcars$mpg ~ mtcars$hp, main = "Scatterplot of MPG vs HP", xlab = "Horsepower", ylab = "Miles Per Gallon")
abline(model, col = "red")

In the lm function, the tilde character (~) signifies “as a function of”. The abline function then adds a line to the plot based on the coefficients in the linear model. The line is colored red for visibility.

## Creating Scatterplots and Regression Lines with ggplot2

While base R is sufficient for creating scatterplots and regression lines, the ggplot2 package allows for more flexibility and customization. To use ggplot2, you first need to install and load it into your R environment:

install.packages("ggplot2")
library(ggplot2)

### Scatterplot

The syntax of ggplot2 involves initializing a ggplot object and adding layers to it. To create a scatterplot:

ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
labs(x = "Horsepower", y = "Miles Per Gallon", title = "Scatterplot of MPG vs HP")

Here, aes is used to assign the hp variable to the x-axis and mpg to the y-axis. geom_point then creates the scatterplot. labs is used to provide labels for the x-axis, y-axis, and the plot title.

### Regression Line

To add a regression line, we use the geom_smooth function with the method argument set to “lm”:

ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "red") +
labs(x = "Horsepower", y = "Miles Per Gallon", title = "Scatterplot of MPG vs HP with Regression Line")

In geom_smooth, method = lm indicates that a linear model should be used, se = FALSE removes the shaded confidence interval around the line, and color = "red" makes the line red for visibility.

## Interpreting the Plot

With the scatterplot and regression line complete, you can start interpreting the plot. The scatterplot gives you a general idea of the relationship between the variables. The regression line can provide insights into the nature of the relationship. A steeper slope indicates a stronger relationship between the variables. However, keep in mind that correlation does not imply causation – additional analysis would be necessary to determine any causal relationships.

## Conclusion

Scatterplots with regression lines are powerful tools for exploring the relationship between two quantitative variables. Both base R and ggplot2 offer robust functionalities for creating these plots, with ggplot2 offering additional flexibility and customization options.

Posted in RTagged