How to Simulate & Plot a Bivariate Normal Distribution in R

Spread the love

A bivariate normal distribution is a two-dimensional normal distribution. It describes two statistical variables that are normally distributed and are related to each other in some way. Bivariate normal distribution is a crucial concept in multivariate statistics, often used in disciplines like machine learning, finance, and natural sciences.

In R, the mvrnorm() function from the MASS package is used to generate random numbers from the multivariate normal distribution. In this article, we will guide you through the steps to simulate and plot a bivariate normal distribution in R.

1. Install and Load Necessary Packages

We will be using three packages: MASS for generating multivariate normal random numbers, ggplot2 for plotting, and reshape2 for reshaping the data. If these are not already installed, you can install them using the install.packages() function.

install.packages("MASS")
install.packages("ggplot2")
install.packages("reshape2")

After installing the necessary packages, load them into your R environment with the library() function.

library(MASS)
library(ggplot2)
library(reshape2)

2. Simulating a Bivariate Normal Distribution

We will simulate a bivariate normal distribution using the mvrnorm() function. This function generates random vectors from a multivariate normal distribution. The syntax of the function is mvrnorm(n, mu, Sigma), where:

  • n is the number of random vectors to generate,
  • mu is a vector of means,
  • Sigma is a positive-definite symmetric matrix specifying the covariance matrix of the variables.

Here is an example of how to generate 1000 bivariate normally distributed random numbers:

set.seed(123)  # For reproducibility

# Parameters
n <- 1000
mu <- c(0, 0)  # Mean
Sigma <- matrix(c(1, 0.8, 0.8, 1), nrow=2)  # Covariance matrix

# Generate bivariate normal data
data <- mvrnorm(n, mu, Sigma)

This generates 1000 pairs of random numbers from a bivariate normal distribution with mean vector mu and covariance matrix Sigma.

3. Visualizing the Bivariate Normal Distribution

Once we have the simulated data, we can plot it using ggplot2 to visualize the bivariate normal distribution. A common way to do this is to create a scatter plot. Here’s an example:

# Create a data frame and set column names
df <- as.data.frame(data)
colnames(df) <- c("X1", "X2")

# Visualizing the Bivariate Normal Distribution with a scatter plot
ggplot(df, aes(X1, X2)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Variable 1", y = "Variable 2", title = "Scatter plot of Bivariate Normal Distribution")

This code creates a scatter plot of the two variables. The geom_point() function adds the points to the plot, and alpha = 0.5 makes the points semi-transparent to visualize the density of points better.

4. Creating a Contour Plot

While a scatter plot can give a general idea of the distribution of points, a contour plot can provide a clearer picture of the bivariate normal distribution. Here’s how to create a contour plot:

# Estimate density
df_density <- kde2d(df$X1, df$X2, n = 100)

# Convert to data frame for ggplot
df_contour <- melt(df_density$z)
names(df_contour) <- c("Variable1", "Variable2", "Density")

# Add X1 and X2 to the data frame
df_contour$X1 <- df_density$x[df_contour$Variable1]
df_contour$X2 <- df_density$y[df_contour$Variable2]

# Create contour plot
ggplot(df_contour, aes(X1, X2, z = Density)) +
  geom_tile(aes(fill = Density)) +
  geom_contour(colour = "white") +
  scale_fill_gradient(low = "white", high = "red") +
  theme_minimal() +
  labs(x = "Variable 1", y = "Variable 2", fill = "Density",
       title = "Contour Plot of Bivariate Normal Distribution")

In this code, the kde2d() function is used to estimate the density of points, which is then converted to a data frame that can be used with ggplot(). The geom_tile() function is used to create the colored tiles, and geom_contour() adds the contour lines.

Conclusion

Simulating and plotting a bivariate normal distribution in R can be accomplished with a few powerful functions. This process is vital for many fields, including data science, finance, machine learning, and more. With R, you can not only simulate complex multivariate distributions but also create rich and informative visualizations. It’s just another example of how R is an essential tool for anyone working with statistical data.

Posted in RTagged

Leave a Reply