The summary() function in the R programming language is a versatile function that gives you a quick and insightful glimpse into the central tendencies and variations of your data. It is an essential tool in the data analysis phase, and every data scientist or analyst working with R should be familiar with it.
This comprehensive guide will discuss how to use the summary() function in R. We’ll start with the basic syntax and structure, progress into various applications and examples, and also discuss some common issues and potential solutions.
Basic Syntax of the summary() Function
The primary syntax of the summary() function in R is as follows:
summary(object, ...)
The function’s parameters are:
- object: This can be any R object, such as a vector, matrix, data frame, or even a model. The kind of summary provided depends on the class of the object.
- …: These are additional arguments affecting the summary produced, which depend on the class of the object.
Basic Usage of the summary() Function
Let’s start by looking at how the summary() function works with basic data types, like vectors and matrices.
Example 1 – Numerical Vector
numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
summary(numbers)
In this example, the function computes and returns the minimum value, the first quartile (25th percentile), the median (50th percentile), the mean, the third quartile (75th percentile), and the maximum value of the vector.
Example 2 – Categorical Vector
animals <- c("cat", "dog", "cat", "bird", "dog", "cat", "bird", "bird")
summary(animals)
When used with a factor or character vector, summary() returns a frequency table of the levels in the vector.
Example 3 – Matrix
matrix <- matrix(c(1:10), nrow = 5, ncol = 2)
summary(matrix)
In the case of a matrix, summary() applies to each column of the matrix separately.
Using summary() with Data Frames
One of the most common uses of the summary() function is with data frames, which allows us to get a summary of each column in the data frame.
Example 4 – Data Frame
data_frame <- data.frame(
"Numbers" = c(1, 2, 3, 4, 5),
"Animals" = c("cat", "dog", "cat", "bird", "dog")
)
summary(data_frame)
In this example, summary() provides a six-number summary (minimum, first quartile, median, mean, third quartile, and maximum) for the “Numbers” column, and a frequency table for the “Animals” column.
Using summary() with Models
The summary() function is also used in R to provide diagnostic information about model objects (such as linear models, generalized linear models, etc.)
Example 5 – Linear Model
data(mtcars)
model <- lm(mpg ~ cyl, data = mtcars)
summary(model)
In this example, summary() provides a full summary of the linear regression model, including the residuals, coefficients, residual standard error, R-squared, and F-statistic.
Customizing the summary() Function
One of the powerful features of R is the ability to create custom functions. By creating a custom summary function, you can choose what kind of information you want to get from your data.
Example 6 – Custom Summary Function
custom_summary <- function(x) {
list(
"mean" = mean(x),
"median" = median(x),
"standard deviation" = sd(x),
"variance" = var(x)
)
}
numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
custom_summary(numbers)
In this example, we create a custom summary function that returns the mean, median, standard deviation, and variance of a numerical vector.
Common Issues and Solutions
While the summary() function is a powerful tool, it’s also worth noting some common issues that can arise when using it:
- Missing Values: By default, the summary() function does not handle missing values (NA). This can be adjusted using na.rm = TRUE inside the function where required.
- Large Data: When dealing with large datasets, the summary() function might become less useful, as it may be challenging to interpret the output. In such cases, consider using more specific summary or visualization techniques for your data.
- Complex Objects: The output from summary() can be quite detailed when used with complex objects like models. Take the time to understand what the output means. For instance, the output for linear models includes important diagnostic measures that can guide the evaluation and interpretation of your model.
In conclusion, the summary() function is a powerful tool in R that allows quick and insightful descriptive analysis of various R objects. Whether you’re doing preliminary data exploration or performing post-model diagnostics, the summary() function offers a handy and swift way to understand your data and models.