How to Overlay Normal Curve on Histogram in R

Spread the love

Visualizing data distributions is a key task in exploratory data analysis. Histograms and Normal curves are widely used for this purpose. A histogram provides a visual representation of data distribution by splitting it into bins of equal intervals and showcasing the frequency of data points within each bin. A Normal curve (or Gaussian distribution) is a type of continuous probability distribution for a real-valued random variable. Overlaying a Normal curve on a histogram can provide a helpful context for understanding the data distribution and assessing whether it follows a Normal distribution.

In this article, we will discuss how to create a histogram and overlay a Normal curve on it using both base R and the ggplot2 package. We will also discuss how to handle instances when the data does not follow a Normal distribution.

Using Built-in Data in R

To keep things simple, this tutorial will use the built-in mtcars dataset in R. This dataset provides various attributes of 32 car models, including miles per gallon (mpg), number of cylinders (cyl), and horsepower (hp).

Let’s take a look at the first few rows of the dataset:

head(mtcars)

Overlaying a Normal Curve on a Histogram in Base R

Creating histograms and Normal curves in base R involves using a combination of the hist, dnorm, mean, and sd functions.

Creating a Histogram

First, let’s create a histogram for the mpg column. The hist function returns a list of values which we will use later, so we need to save the output:

hist_data <- hist(mtcars$mpg, main = "Histogram of MPG", xlab = "Miles Per Gallon", ylab = "Frequency", col = "lightblue", border = "black")

In this code snippet, the main, xlab, and ylab parameters are used to set the title of the histogram, the x-axis label, and the y-axis label, respectively. The col parameter is used to set the color of the bars, and border sets the color of the border around the bars.

Overlaying a Normal Curve

To overlay a Normal curve, we first need to calculate the mean (mean) and standard deviation (sd) of the data. The dnorm function is then used to generate the y-coordinates of the Normal curve based on these values:

mean_mpg <- mean(mtcars$mpg)
sd_mpg <- sd(mtcars$mpg)
curve_density <- dnorm(hist_data$mids, mean = mean_mpg, sd = sd_mpg)

Finally, we add the Normal curve to the histogram using the lines function. We need to adjust the y-coordinates of the curve to match the scale of the histogram, which we do by multiplying the density by the binwidth and the count of observations:

curve_height <- curve_density * diff(hist_data$mids[1:2]) * length(mtcars$mpg)
lines(hist_data$mids, curve_height, col = "darkblue", lwd = 2)

The col and lwd parameters in the lines function set the color and line width of the Normal curve, respectively.

Overlaying a Normal Curve on a Histogram with ggplot2

While base R provides the necessary functionality, the ggplot2 package can create more aesthetically pleasing and customizable graphics. To use ggplot2, you first need to install and load it into your R environment:

install.packages("ggplot2")
library(ggplot2)

Creating a Histogram

The ggplot function initializes a ggplot object, and the geom_histogram function adds a histogram layer:

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), colour = "black", fill = "lightblue", bins = 30) +
  labs(title = "Histogram of MPG with Normal Curve", x = "Miles Per Gallon", y = "Density")

The aes function maps the mpg variable to the x-axis, and y = ..density.. sets the y-axis to represent density rather than frequency. The colour, fill, and bins parameters in geom_histogram set the border color, fill color, and number of bins, respectively.

Overlaying a Normal Curve

The geom_density and stat_function functions are used to overlay a Normal curve:

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), colour = "black", fill = "lightblue", bins = 30) +
  geom_density(colour = "darkblue", lwd = 1.5) +
  stat_function(fun = dnorm, args = list(mean = mean(mtcars$mpg), sd = sd(mtcars$mpg)), colour = "red", lwd = 1.5) +
  labs(title = "Histogram of MPG with Normal Curve", x = "Miles Per Gallon", y = "Density")

geom_density adds a density plot based on the mpg data, and stat_function adds a theoretical Normal distribution based on the calculated mean and standard deviation of the mpg data. The colour and lwd parameters set the color and line width of the curves.

Conclusion

Overlaying a Normal curve on a histogram is a common task when exploring data distributions. Both base R and ggplot2 offer robust functionality to create these plots, with ggplot2 offering more customization options.

Posted in RTagged

Leave a Reply