In statistics, a histogram is an efficient graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable. When we construct a histogram, it’s quite common to normalize it to give a relative frequency histogram, which shows the proportion of the dataset that falls within each bin.

This article will guide you through the process of creating a relative frequency histogram in R, using both the base R package and the ggplot2 package. We will also cover some advanced options for customization, enabling you to create highly detailed and useful plots.

## 1. Understanding Histograms and Relative Frequencies

In a standard histogram, the y-axis represents the absolute frequency of data points within each bin. Bins are defined by dividing the range of data into equal intervals.

However, in a relative frequency histogram, the y-axis represents the proportion of total observations that fall within each bin. Thus, the sum of the bar areas in a relative frequency histogram equals 1. This can be very helpful when you want to compare distributions with differing numbers of observations.

## 2. Creating a Relative Frequency Histogram in Base R

Creating a relative frequency histogram in base R involves a two-step process: generating the histogram and then modifying it to display relative frequencies. Here’s how to do it:

```
# Generate a sample dataset
data <- rnorm(1000)
# Create a histogram object
h <- hist(data, plot = FALSE)
# Transform counts into relative frequencies
h$counts <- h$counts / sum(h$counts)
# Plot the relative frequency histogram
plot(h, freq = FALSE, main = "Relative Frequency Histogram", xlab = "Bins", ylab = "Relative Frequency")
```

Here’s what happens in this code:

`rnorm(1000)`

generates a normal distribution of 1000 random values, which is stored in`data`

.`hist(data, plot = FALSE)`

calculates the histogram data without plotting it and stores the result in`h`

.`h$counts <- h$counts / sum(h$counts)`

transforms the absolute frequencies stored in`h$counts`

into relative frequencies.`plot(h, freq = FALSE, main = "Relative Frequency Histogram", xlab = "Bins", ylab = "Relative Frequency")`

plots the relative frequency histogram.

The `freq = FALSE`

argument in `plot()`

is necessary to indicate that the y-axis should represent densities, not frequencies.

## 3. Creating a Relative Frequency Histogram with ggplot2

ggplot2 is a powerful package for creating high-quality plots in R. It offers a more straightforward way of creating a relative frequency histogram through the `..density..`

built-in variable.

First, ensure that you’ve installed and loaded the ggplot2 package:

```
install.packages("ggplot2")
library(ggplot2)
```

Then, you can create a relative frequency histogram as follows:

```
# Generate a data frame from the dataset
df <- data.frame(data)
# Create a relative frequency histogram
ggplot(df, aes(x = data)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "skyblue") +
labs(title = "Relative Frequency Histogram", x = "Bins", y = "Relative Frequency")
```

In this code:

`data.frame(data)`

generates a data frame from the`data`

dataset.`ggplot(df, aes(x = data))`

initializes the ggplot2 object, specifying`data`

as the x-axis.`geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "skyblue")`

adds a histogram layer to the plot, where the`..density..`

variable calculates the density of each bin to represent relative frequencies. The`bins`

argument sets the number of bins, and`color`

and`fill`

define the outline and fill colors of the bars.`labs(title = "Relative Frequency Histogram", x = "Bins", y = "Relative Frequency")`

adds labels to the plot.

## 4. Advanced Customization of Relative Frequency Histograms

Both base R and ggplot2 provide numerous options for customizing histograms, such as modifying colors, bin widths, and adding statistical overlays. Here are some examples:

### 4.1 Adding a Density Curve to a ggplot2 Histogram

You can add a density curve to the histogram to visualize the estimated probability density function of the data:

```
ggplot(df, aes(x = data)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "skyblue", alpha = 0.5) +
geom_density(color = "red") +
labs(title = "Relative Frequency Histogram with Density Curve", x = "Bins", y = "Relative Frequency")
```

In this code, `geom_density(color = "red")`

adds a density curve to the plot in red color.

### 4.2 Adding Labels to Bars in a Base R Histogram

In base R, you can add frequency labels to each bar:

```
h <- hist(data, plot = FALSE)
h$counts <- h$counts / sum(h$counts)
plot(h, freq = FALSE, main = "Relative Frequency Histogram", xlab = "Bins", ylab = "Relative Frequency")
# Add labels to bars
text(h$mids, h$counts, labels = round(h$counts, 2), pos = 3, cex = 0.8)
```

The `text()`

function adds text to the plot. `h$mids`

and `h$counts`

specify the x and y coordinates for the labels, `labels = round(h$counts, 2)`

defines the labels as the rounded relative frequencies, `pos = 3`

places the labels above the bars, and `cex = 0.8`

sets the font size.

In conclusion, whether you’re using base R or ggplot2, you can effectively create and customize relative frequency histograms in R. These histograms are excellent tools for understanding the distribution and density of your data, providing valuable insights for your data analysis.