How to Create a Scatter Plot in R

Spread the love

One of the most common and useful types of graphics in statistical analysis is the scatter plot. In this article, we’ll discuss how to create a scatter plot in R using both base R and ggplot2, one of the most popular packages in R for data visualization.

What is a Scatter Plot?

A scatter plot is a type of plot using Cartesian coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation exists between the two variables. Essentially, scatter plots show how much one variable is affected by another, thus establishing a relationship between them.

Scatter plots are particularly useful for investigating the strength and direction of relationships between variables, detecting outliers, and identifying trends.

Creating a Simple Scatter Plot in R

The most basic way to create a scatter plot in R is by using the plot() function. Let’s use the built-in mtcars dataset in R, which contains various car attributes, to create a scatter plot showing the relationship between horsepower (hp) and miles per gallon (mpg).

# Load the data
data(mtcars)

# Create a scatter plot
plot(mtcars$hp, mtcars$mpg, 
     main = "Horsepower vs. Miles per Gallon",
     xlab = "Horsepower", 
     ylab = "Miles per Gallon")

The plot() function in this case takes two arguments: the x and y variables. You can add a title to the plot and labels to the axes using the main, xlab, and ylab parameters, respectively.

Changing the Color and Shape of Points

You can customize your scatter plot in various ways. For instance, you can change the color and shape of points in your scatter plot. The col parameter allows you to change the color of the points, while the pch parameter can be used to change the shape.

plot(mtcars$hp, mtcars$mpg, 
     main = "Horsepower vs. Miles per Gallon",
     xlab = "Horsepower", 
     ylab = "Miles per Gallon",
     col = "blue",
     pch = 19)

The value of pch can be anything from 0 to 25, and each value represents a different symbol.

Adding a Regression Line

To visualize the relationship between the two variables more clearly, you can add a regression line (or a line of best fit) to your scatter plot. You can do this using the abline() function in conjunction with the lm() function, which fits a linear model to the data.

# Create a scatter plot
plot(mtcars$hp, mtcars$mpg, 
     main = "Horsepower vs. Miles per Gallon",
     xlab = "Horsepower", 
     ylab = "Miles per Gallon",
     col = "blue",
     pch = 19)

# Add a regression line
abline(lm(mtcars$mpg ~ mtcars$hp), col = "red")

The lm() function here is used to fit a linear model to the data, and the resulting model is passed to the abline() function, which adds a straight line to the plot.

Creating Scatter Plots with ggplot2

First, you need to install and load the ggplot2 package:

# Install ggplot2
install.packages("ggplot2")

# Load ggplot2
library(ggplot2)

Once you have ggplot2 installed and loaded, you can create a scatter plot using the geom_point() function.

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(title = "Horsepower vs. Miles per Gallon",
       x = "Horsepower", 
       y = "Miles per Gallon")

In the ggplot() function, the first argument specifies the dataset, and the aes() function defines the aesthetics of the plot, such as the x and y variables. The geom_point() function adds the layer of points to create the scatter plot.

Customizing Scatter Plots in ggplot2

You can also customize the color, shape, and size of points in ggplot2:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "blue", 
             shape = 19, 
             size = 3) +
  labs(title = "Horsepower vs. Miles per Gallon",
       x = "Horsepower", 
       y = "Miles per Gallon")

Adding a Regression Line in ggplot2

Just like in base R, you can add a regression line to your scatter plot in ggplot2. You can do this using the geom_smooth() function, which adds a smoothed conditional mean:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "blue", 
             shape = 19, 
             size = 3) +
  geom_smooth(method = lm, 
              se = FALSE, 
              color = "red") +
  labs(title = "Horsepower vs. Miles per Gallon",
       x = "Horsepower", 
       y = "Miles per Gallon")

Here, the method = lm argument specifies that a linear model should be fitted to the data, and se = FALSE means that the standard error should not be plotted.

Conclusion

Scatter plots are essential tools in statistical and data analysis, as they can illustrate the relationship between two variables. This article has shown you how to create scatter plots in R using both base R and ggplot2. With these tools in your data analysis toolkit, you can start exploring the relationships within your own datasets.

Posted in RTagged

Leave a Reply