Categorical data is a staple in many data science and research applications. It provides a comprehensive view of a dataset’s structure by grouping and summarizing different categories. This article will explore how to plot categorical data in R.
What is Categorical Data?
Before we dive into plotting, it’s crucial to understand what categorical data is. Categorical data, also known as qualitative data, groups information into various categories or levels. These levels do not have a numerical or quantitative value but instead denote a characteristic or quality. For instance, the color of a car (red, blue, green, etc.) is categorical, as is a person’s profession (teacher, engineer, doctor, etc.).
Categorical data can further be broken down into two types:
- Nominal data: This type of categorical data has no order or hierarchy. For example, the color of a car is nominal because there is no ranking between colors.
- Ordinal data: This type of data does have an implied order. A common example is rating a restaurant on a scale from 1-5. Here, 5 is better than 1, indicating an inherent order.
For this tutorial, we’ll use the mtcars dataset, a built-in dataset in R that comprises various car attributes.
Let’s convert some of the numerical data into categorical data to illustrate our examples better. We’ll create a new categorical variable, ‘mpg_level’, derived from the ‘mpg’ variable. This new variable will categorize the miles per gallon (mpg) into three levels.
library(dplyr) mtcars <- mtcars %>% mutate(mpg_level = case_when( mpg <= 20 ~ 'Low', mpg > 20 & mpg <= 30 ~ 'Medium', mpg > 30 ~ 'High' ))
A bar plot is one of the most common ways to visualize categorical data. It displays the category levels on one axis and the count of records in each category on the other.
Let’s create a basic bar plot for the ‘mpg_level’ variable:
library(ggplot2) ggplot(mtcars, aes(x = mpg_level)) + geom_bar() + labs(title = "Basic Bar Plot", x = "Miles Per Gallon Level", y = "Count")
A box plot, or box-and-whisker plot, is another excellent tool for visualizing categorical data. It shows the distribution of quantitative data across several categories, making it easy to compare between different categories.
ggplot(mtcars, aes(x = mpg_level, y = hp)) + geom_boxplot() + labs(title = "Box Plot", x = "Miles Per Gallon Level", y = "Horsepower")
Violin plots combine the features of box plots and kernel density plots. They provide a good representation of the data’s distribution while also showing the probability density of the data at different values.
ggplot(mtcars, aes(x = mpg_level, y = hp)) + geom_violin() + labs(title = "Violin Plot", x = "Miles Per Gallon Level", y = "Horsepower")
Pie charts represent categories’ proportions in a whole by dividing a circle into proportional segments.
mtcars %>% count(mpg_level) %>% ggplot(aes(x = "", y = n, fill = mpg_level)) + geom_bar(width = 1, stat = "identity") + coord_polar("y", start = 0) + labs(title = "Pie Chart", x = "", y = "Count", fill = "Miles Per Gallon Level")
Visualizing categorical data is crucial in the data analysis process, offering insights that raw data fails to communicate. By plotting categorical data, we can make these insights tangible, meaningful, and ready for interpretation. While the techniques we’ve covered are essential, there are many more ways to visualize categorical data in R.
Remember, each dataset is unique and may require different techniques for visualization. The trick lies in understanding your data and choosing the plots that best represent the insights you seek.