How to Create a Pareto Chart in R

A Pareto Chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line. Pareto Charts are based on the Pareto Principle, also known as the 80/20 rule, which states that 80% of the results come from 20% of the causes. In the context of a Pareto chart, it means that 80% of the effect comes from 20% of the causes. This type of chart is often used in quality control to identify the most critical issues or causes.

R, being a versatile language for statistical analysis, provides numerous ways to create a Pareto chart. In this article, we will explore two approaches: creating a Pareto chart using base R functions, and using the ggplot2 package.

Creating a Pareto Chart Using Base R

In this section, we will create a Pareto chart using only the base R functions. Here are the steps involved:

1. Creating a Dataset: First, we create a simple dataset.
# Create a dataset
set.seed(123)
category <- LETTERS[1:10]
frequency <- sample(100:200, 10)
df <- data.frame(category, frequency)

In this dataset, category represents different causes or issues, and frequency represents the number of occurrences of each cause.

1. Sorting and Cumulative Frequency Calculation: The next step is to sort the data in descending order of frequency and calculate the cumulative frequency.
# Sort the data and calculate the cumulative frequency
df <- df[order(-df$frequency),] df$cumulative_frequency <- cumsum(df$frequency) 3. Calculating the Cumulative Percentage: Now, we need to calculate the cumulative percentage of the frequencies. # Calculate the cumulative percentage df$cumulative_percentage <- df$cumulative_frequency / sum(df$frequency) * 100

4. Creating the Pareto Chart: Finally, we create the Pareto chart using the barplot() and lines() functions.

# Create the Pareto chart
barplot(df$frequency, names.arg = df$category, las=2, col="skyblue",
main="Pareto Chart", xlab="Category", ylab="Frequency")
par(new=TRUE)
plot(df$cumulative_percentage, type="o", col="red", axes=FALSE, ann=FALSE) axis(side=4) mtext(side=4, line=3, 'Cumulative Percentage') In this code, barplot() creates a bar plot of frequencies, par(new=TRUE) allows us to add another plot on the current plot, plot() creates a line plot of the cumulative percentage, and axis() and mtext() add a secondary y-axis for the cumulative percentage. Creating a Pareto Chart Using ggplot2 While the base R functions provide a straightforward way to create a Pareto chart, the ggplot2 package offers a more flexible and powerful way to create and customize the chart. Here are the steps involved: 1. Installing and Loading the ggplot2 Package: The first step is to install and load the ggplot2 package # Install install.packages("ggplot2") # Load library(ggplot2) 2. Creating a Dataset, Sorting and Cumulative Frequency Calculation: The steps are similar to the base R approach. # Create a dataset set.seed(123) category <- LETTERS[1:10] frequency <- sample(100:200, 10) df <- data.frame(category, frequency) # Sort the data and calculate the cumulative frequency df <- df[order(-df$frequency),]
df$cumulative_frequency <- cumsum(df$frequency)

# Calculate the cumulative percentage
df$cumulative_percentage <- df$cumulative_frequency / sum(df$frequency) * 100 3. Creating the Pareto Chart: Finally, we create the Pareto chart using the ggplot() and geom_bar() functions for the bar plot, and geom_line() and geom_point() for the line plot. # Create the Pareto chart ggplot(df, aes(x = category)) + geom_bar(aes(y = frequency), stat="identity", fill="skyblue") + geom_line(aes(y = cumulative_percentage), group=1, colour="red") + geom_point(aes(y = cumulative_percentage), group=1, colour="red") + scale_y_continuous(sec.axis = sec_axis(~./max(df$frequency)*100, name = "Cumulative Percentage")) +
labs(title="Pareto Chart", x="Category", y="Frequency") +
theme_minimal()

In this code, geom_bar() creates the bar plot, geom_line() and geom_point() create the line plot, scale_y_continuous() adds a secondary y-axis for the cumulative percentage, labs() adds the title and axis labels, and theme_minimal() sets the theme of the plot.

Conclusion

A Pareto chart is a helpful tool in quality control and business decision-making, allowing us to focus on the most critical issues. This article demonstrated two methods of creating a Pareto chart in R, one using base R functions and the other using the ggplot2 package. Each method has its advantages: the base R approach is straightforward and requires no additional packages, while the ggplot2 approach provides more control over the appearance of the chart. Choose the method that suits your needs best. Remember, the essence of the Pareto chart is its principle: to prioritize the few significant over the many insignificant.

Posted in RTagged