# How to Overlay Normal Curve on Histogram in R

Visualizing data distributions is a key task in exploratory data analysis. Histograms and Normal curves are widely used for this purpose. A histogram provides a visual representation of data distribution by splitting it into bins of equal intervals and showcasing the frequency of data points within each bin. A Normal curve (or Gaussian distribution) is a type of continuous probability distribution for a real-valued random variable. Overlaying a Normal curve on a histogram can provide a helpful context for understanding the data distribution and assessing whether it follows a Normal distribution.

In this article, we will discuss how to create a histogram and overlay a Normal curve on it using both base R and the ggplot2 package. We will also discuss how to handle instances when the data does not follow a Normal distribution.

## Using Built-in Data in R

To keep things simple, this tutorial will use the built-in mtcars dataset in R. This dataset provides various attributes of 32 car models, including miles per gallon (mpg), number of cylinders (cyl), and horsepower (hp).

Let’s take a look at the first few rows of the dataset:

head(mtcars)

## Overlaying a Normal Curve on a Histogram in Base R

Creating histograms and Normal curves in base R involves using a combination of the hist, dnorm, mean, and sd functions.

### Creating a Histogram

First, let’s create a histogram for the mpg column. The hist function returns a list of values which we will use later, so we need to save the output:

hist_data <- hist(mtcars$mpg, main = "Histogram of MPG", xlab = "Miles Per Gallon", ylab = "Frequency", col = "lightblue", border = "black") In this code snippet, the main, xlab, and ylab parameters are used to set the title of the histogram, the x-axis label, and the y-axis label, respectively. The col parameter is used to set the color of the bars, and border sets the color of the border around the bars. ### Overlaying a Normal Curve To overlay a Normal curve, we first need to calculate the mean (mean) and standard deviation (sd) of the data. The dnorm function is then used to generate the y-coordinates of the Normal curve based on these values: mean_mpg <- mean(mtcars$mpg)
sd_mpg <- sd(mtcars$mpg) curve_density <- dnorm(hist_data$mids, mean = mean_mpg, sd = sd_mpg)

Finally, we add the Normal curve to the histogram using the lines function. We need to adjust the y-coordinates of the curve to match the scale of the histogram, which we do by multiplying the density by the binwidth and the count of observations:

curve_height <- curve_density * diff(hist_data$mids[1:2]) * length(mtcars$mpg)
lines(hist_data$mids, curve_height, col = "darkblue", lwd = 2) The col and lwd parameters in the lines function set the color and line width of the Normal curve, respectively. ## Overlaying a Normal Curve on a Histogram with ggplot2 While base R provides the necessary functionality, the ggplot2 package can create more aesthetically pleasing and customizable graphics. To use ggplot2, you first need to install and load it into your R environment: install.packages("ggplot2") library(ggplot2) ### Creating a Histogram The ggplot function initializes a ggplot object, and the geom_histogram function adds a histogram layer: ggplot(mtcars, aes(x = mpg)) + geom_histogram(aes(y = ..density..), colour = "black", fill = "lightblue", bins = 30) + labs(title = "Histogram of MPG with Normal Curve", x = "Miles Per Gallon", y = "Density") The aes function maps the mpg variable to the x-axis, and y = ..density.. sets the y-axis to represent density rather than frequency. The colour, fill, and bins parameters in geom_histogram set the border color, fill color, and number of bins, respectively. ### Overlaying a Normal Curve The geom_density and stat_function functions are used to overlay a Normal curve: ggplot(mtcars, aes(x = mpg)) + geom_histogram(aes(y = ..density..), colour = "black", fill = "lightblue", bins = 30) + geom_density(colour = "darkblue", lwd = 1.5) + stat_function(fun = dnorm, args = list(mean = mean(mtcars$mpg), sd = sd(mtcars\$mpg)), colour = "red", lwd = 1.5) +
labs(title = "Histogram of MPG with Normal Curve", x = "Miles Per Gallon", y = "Density")

geom_density adds a density plot based on the mpg data, and stat_function adds a theoretical Normal distribution based on the calculated mean and standard deviation of the mpg data. The colour and lwd parameters set the color and line width of the curves.

## Conclusion

Overlaying a Normal curve on a histogram is a common task when exploring data distributions. Both base R and ggplot2 offer robust functionality to create these plots, with ggplot2 offering more customization options.

Posted in RTagged