How to Apply the Central Limit Theorem in R

Spread the love

The Central Limit Theorem (CLT) is a fundamental theorem in statistics that states if you have a population with mean µ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is reasonably large (usually n > 30).

In this article, we will guide you through the steps to illustrate the Central Limit Theorem in the R programming language, a powerful tool for statistical analysis.

1. Simulating Data

We will begin by simulating a population of data. In this example, we will generate data from an exponential distribution using the rexp() function. We will use the set.seed() function to ensure that the random numbers generated are reproducible.

set.seed(123)  # For reproducibility
population <- rexp(10000, rate = 0.2)

This code generates 10,000 data points following an exponential distribution with a rate of 0.2. It’s important to note that an exponential distribution is heavily skewed and does not follow a normal distribution.

2. Visualizing the Population Distribution

We can visualize the population distribution to confirm its shape. In this case, we expect an exponential distribution.

library(ggplot2)

ggplot(data.frame(population), aes(population)) +
  geom_histogram(aes(y=..density..), bins = 50, fill="blue", alpha=0.5) +
  theme_minimal() +
  labs(x = "Population data", y = "Density", title = "Distribution of Population")

The geom_histogram() function creates the histogram, and we use labs() to label the axes and the plot title.

3. Simulating Sample Means

Next, we will simulate the process of sampling from this population and computing the sample means. According to the CLT, the distribution of these sample means should approach a normal distribution as the sample size increases.

We will create a function that takes as input the population data, the sample size, and the number of samples, and returns a vector of sample means.

sample_means <- function(data, n, m) {
  means <- replicate(m, mean(sample(data, n, replace = TRUE)))
  return(means)
}

In this function, the replicate() function is used to repeat the sampling and mean calculation process m times. The sample() function is used to draw a sample of size n from the population data with replacement.

4. Computing and Visualizing the Sample Means

We can now apply our function to compute the sample means. For instance, let’s take 1,000 samples of size 50 from our population and calculate the means.

set.seed(123)
sample_size <- 50
num_samples <- 1000
means <- sample_means(population, sample_size, num_samples)

Then, let’s visualize the distribution of these sample means.

ggplot(data.frame(means), aes(means)) +
  geom_histogram(aes(y=..density..), bins = 50, fill="blue", alpha=0.5) +
  theme_minimal() +
  labs(x = "Sample means", y = "Density", title = "Distribution of Sample Means")

We can see that, consistent with the Central Limit Theorem, the distribution of sample means appears roughly normally distributed, despite the population following an exponential distribution.

5. Varying Sample Sizes

To further illustrate the CLT, we can repeat this process for varying sample sizes. As the sample size increases, we should see the distribution of the sample means become increasingly normally distributed.

set.seed(123)
sample_sizes <- c(5, 30, 50, 100)
num_samples <- 1000

plot_list <- list()
for (i in seq_along(sample_sizes)) {
  means <- sample_means(population, sample_sizes[i], num_samples)
  
  p <- ggplot(data.frame(means), aes(means)) +
    geom_histogram(aes(y=..density..), bins = 50, fill="blue", alpha=0.5) +
    theme_minimal() +
    labs(x = "Sample means", y = "Density", title = paste("Distribution of Sample Means (n=", sample_sizes[i], ")", sep=""))
  
  plot_list[[i]] <- p
}

gridExtra::grid.arrange(grobs = plot_list, ncol = 2)

In this code, we loop over different sample sizes, compute the sample means, and create a histogram for each. The grid.arrange() function from the gridExtra package is used to arrange the multiple plots in a grid for easier comparison. You can install it using install.packages("gridExtra").

Conclusion

The Central Limit Theorem is one of the most important concepts in statistics due to its implications for the distribution of sample means. Regardless of the shape of the population distribution, the distribution of sample means tends to become more normally distributed as the sample size increases. This principle underlies many statistical methods, including the creation of confidence intervals and hypothesis testing. By leveraging R’s versatile functionalities, you can illustrate the Central Limit Theorem, aiding in the understanding of this crucial statistical concept.

Posted in RTagged

Leave a Reply