Scatter plots are one of the most effective ways to visualize the relationship between two numeric variables. They allow us to observe trends, patterns, and potential outliers. In many instances, we also have categorical variables that divide data into groups. Being able to create scatter plots by group adds an additional dimension to our analysis and can provide valuable insights.
This article will guide you through the process of creating scatter plots by group in R, using both base R functions and the popular data visualization package ggplot2
.
Understanding Scatter Plots
A scatter plot is a diagram where each value in the data set is represented by a dot. The position of a dot on the x and y axis indicates values for an individual data point. Scatter plots can show a variety of information, including:
- How much one variable is affected by another.
- The direction of the relationship between variables.
- The strength of the relationship between variables.
- Outlier points.
Using Built-in Data in R
For the sake of simplicity, this guide will utilize the built-in mtcars
data set in R. This data set comprises fuel consumption data (mpg – miles per gallon) and ten aspects of automobile design and performance for 32 automobiles.
You can take a peek at the data using the head
function:
head(mtcars)
Creating Scatter Plots in Base R
Let’s start by making a simple scatter plot in base R, without grouping. Suppose we want to plot mpg
(miles per gallon) against hp
(horsepower). We would use the plot
function as follows:
plot(mtcars$mpg ~ mtcars$hp, xlab = "Horsepower", ylab = "Miles Per Gallon", main = "Scatterplot of MPG vs HP")

Here, xlab
, ylab
, and main
are used to provide labels for the x-axis, y-axis, and the plot title, respectively.
Grouping Scatter Plots in Base R
Now, let’s suppose we want to distinguish between cars with automatic and manual transmissions (represented by the am
variable in the data). Here, we can utilize the ifelse
statement in R, which takes the following form: ifelse(test, yes, no)
. If test
is TRUE
, yes
is returned; if test
is FALSE
, no
is returned.
colors <- ifelse(mtcars$am == 0, "red", "blue")
plot(mtcars$mpg ~ mtcars$hp, col = colors, pch = 19, xlab = "Horsepower", ylab = "Miles Per Gallon", main = "Scatterplot of MPG vs HP by Transmission")
legend("topright", legend = c("Automatic", "Manual"), col = c("red", "blue"), pch = 19)

In this code, we’ve assigned a different color to each group, with “red” for automatic cars and “blue” for manual cars. The legend
function is used to add a legend to the plot.
Creating Scatter Plots with ggplot2
While base R offers a decent amount of plotting capabilities, ggplot2
is a widely-used package that provides advanced and aesthetically pleasing graphics.
First, ensure that the ggplot2
package is installed and loaded into your workspace:
install.packages("ggplot2")
library(ggplot2)
The syntax for ggplot2
can be somewhat complex, but it’s incredibly flexible once you get the hang of it. Here’s how you can create the same scatter plot as above, but with ggplot2
:
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am))) +
geom_point() +
labs(x = "Horsepower", y = "Miles Per Gallon", color = "Transmission") +
ggtitle("Scatterplot of MPG vs HP by Transmission") +
scale_color_manual(values = c("red", "blue"), labels = c("Automatic", "Manual"))

The aes
function is used to map variables to visual properties (aesthetics) of the graph. Here, we’ve mapped hp
to the x-axis, mpg
to the y-axis, and am
to the color of the points. The factor
function is used to treat am
as a categorical variable.
geom_point
is the layer that actually creates the scatter plot. labs
is used for labels, and ggtitle
for the title. scale_color_manual
is used to manually specify the colors and labels for the different groups.
Conclusion
Scatter plots are powerful tools for visualizing the relationship between two numeric variables. Creating scatter plots by group in R, whether using base R or the ggplot2
package, allows you to add another dimension to your plots, potentially revealing more complex patterns and insights in your data.