Regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. In this article, we will focus on the basics of creating a scatterplot with a regression line in R, using both base R and the popular ggplot2 package.
Introduction to Scatterplots and Regression Lines
A scatterplot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. Each observation (or point) in the plot corresponds to one row in the data set. Scatterplots are used to visualize the relationship between two quantitative variables, and they are especially useful for interpreting trends in data.
The regression line, also known as the line of best fit, is a plot of the predicted values for the dependent variable (Y) as a function of the independent variable (X). When added to a scatterplot, the regression line helps us understand any linear relationship between the two variables. A steeper slope indicates a stronger relationship between the variables.
Using Built-in Data in R
In this guide, we will use the built-in
mtcars data set in R, which contains various car attributes for 32 models. You can take a look at the data using the
Creating Scatterplots and Regression Lines in Base R
Let’s create a scatterplot for the
mtcars data set, showing
mpg (miles per gallon) as a function of
hp (horsepower), and then add a regression line.
To create a scatterplot in base R, we use the
plot(mtcars$mpg ~ mtcars$hp, main = "Scatterplot of MPG vs HP", xlab = "Horsepower", ylab = "Miles Per Gallon")
ylab are used to provide labels for the x-axis and y-axis, respectively, while
main is used to provide a title for the plot.
Next, we create a linear model using the
lm function and add a regression line to the scatterplot using the
model <- lm(mtcars$mpg ~ mtcars$hp) plot(mtcars$mpg ~ mtcars$hp, main = "Scatterplot of MPG vs HP", xlab = "Horsepower", ylab = "Miles Per Gallon") abline(model, col = "red")
lm function, the tilde character (
~) signifies “as a function of”. The
abline function then adds a line to the plot based on the coefficients in the linear model. The line is colored red for visibility.
Creating Scatterplots and Regression Lines with ggplot2
While base R is sufficient for creating scatterplots and regression lines, the
ggplot2 package allows for more flexibility and customization. To use
ggplot2, you first need to install and load it into your R environment:
The syntax of
ggplot2 involves initializing a ggplot object and adding layers to it. To create a scatterplot:
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + labs(x = "Horsepower", y = "Miles Per Gallon", title = "Scatterplot of MPG vs HP")
aes is used to assign the
hp variable to the x-axis and
mpg to the y-axis.
geom_point then creates the scatterplot.
labs is used to provide labels for the x-axis, y-axis, and the plot title.
To add a regression line, we use the
geom_smooth function with the method argument set to “lm”:
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_smooth(method = lm, se = FALSE, color = "red") + labs(x = "Horsepower", y = "Miles Per Gallon", title = "Scatterplot of MPG vs HP with Regression Line")
method = lm indicates that a linear model should be used,
se = FALSE removes the shaded confidence interval around the line, and
color = "red" makes the line red for visibility.
Interpreting the Plot
With the scatterplot and regression line complete, you can start interpreting the plot. The scatterplot gives you a general idea of the relationship between the variables. The regression line can provide insights into the nature of the relationship. A steeper slope indicates a stronger relationship between the variables. However, keep in mind that correlation does not imply causation – additional analysis would be necessary to determine any causal relationships.
Scatterplots with regression lines are powerful tools for exploring the relationship between two quantitative variables. Both base R and
ggplot2 offer robust functionalities for creating these plots, with
ggplot2 offering additional flexibility and customization options.