# How to Create Scatter Plots by Group in R

Scatter plots are one of the most effective ways to visualize the relationship between two numeric variables. They allow us to observe trends, patterns, and potential outliers. In many instances, we also have categorical variables that divide data into groups. Being able to create scatter plots by group adds an additional dimension to our analysis and can provide valuable insights.

This article will guide you through the process of creating scatter plots by group in R, using both base R functions and the popular data visualization package ggplot2.

## Understanding Scatter Plots

A scatter plot is a diagram where each value in the data set is represented by a dot. The position of a dot on the x and y axis indicates values for an individual data point. Scatter plots can show a variety of information, including:

1. How much one variable is affected by another.
2. The direction of the relationship between variables.
3. The strength of the relationship between variables.
4. Outlier points.

## Using Built-in Data in R

For the sake of simplicity, this guide will utilize the built-in mtcars data set in R. This data set comprises fuel consumption data (mpg – miles per gallon) and ten aspects of automobile design and performance for 32 automobiles.

You can take a peek at the data using the head function:

## Creating Scatter Plots in Base R

Let’s start by making a simple scatter plot in base R, without grouping. Suppose we want to plot mpg (miles per gallon) against hp (horsepower). We would use the plot function as follows:

plot(mtcars$mpg ~ mtcars$hp, xlab = "Horsepower", ylab = "Miles Per Gallon", main = "Scatterplot of MPG vs HP")

Here, xlab, ylab, and main are used to provide labels for the x-axis, y-axis, and the plot title, respectively.

## Grouping Scatter Plots in Base R

Now, let’s suppose we want to distinguish between cars with automatic and manual transmissions (represented by the am variable in the data). Here, we can utilize the ifelse statement in R, which takes the following form: ifelse(test, yes, no). If test is TRUE, yes is returned; if test is FALSE, no is returned.

colors <- ifelse(mtcars$am == 0, "red", "blue") plot(mtcars$mpg ~ mtcars\$hp, col = colors, pch = 19, xlab = "Horsepower", ylab = "Miles Per Gallon", main = "Scatterplot of MPG vs HP by Transmission")
legend("topright", legend = c("Automatic", "Manual"), col = c("red", "blue"), pch = 19)

In this code, we’ve assigned a different color to each group, with “red” for automatic cars and “blue” for manual cars. The legend function is used to add a legend to the plot.

## Creating Scatter Plots with ggplot2

While base R offers a decent amount of plotting capabilities, ggplot2 is a widely-used package that provides advanced and aesthetically pleasing graphics.

First, ensure that the ggplot2 package is installed and loaded into your workspace:

install.packages("ggplot2")
library(ggplot2)

The syntax for ggplot2 can be somewhat complex, but it’s incredibly flexible once you get the hang of it. Here’s how you can create the same scatter plot as above, but with ggplot2:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am))) +
geom_point() +
labs(x = "Horsepower", y = "Miles Per Gallon", color = "Transmission") +
ggtitle("Scatterplot of MPG vs HP by Transmission") +
scale_color_manual(values = c("red", "blue"), labels = c("Automatic", "Manual"))

The aes function is used to map variables to visual properties (aesthetics) of the graph. Here, we’ve mapped hp to the x-axis, mpg to the y-axis, and am to the color of the points. The factor function is used to treat am as a categorical variable.

geom_point is the layer that actually creates the scatter plot. labs is used for labels, and ggtitle for the title. scale_color_manual is used to manually specify the colors and labels for the different groups.

## Conclusion

Scatter plots are powerful tools for visualizing the relationship between two numeric variables. Creating scatter plots by group in R, whether using base R or the ggplot2 package, allows you to add another dimension to your plots, potentially revealing more complex patterns and insights in your data.

Posted in RTagged