# How to Use fitdistr() in R

R is a powerful statistical software that boasts an extensive range of packages, making it a highly versatile tool for data analysis and interpretation. One such package is the MASS package which includes the function fitdistr(). This function is used to fit univariate distributions of a random variable by maximum likelihood. In simpler terms, it estimates the most likely parameters of a given distribution that best fit the given data. In this article, we delve into the functionality of the fitdistr() function, exploring its syntax, use cases, and implications.

## 1. Introduction to fitdistr() Function

The fitdistr() function is part of the MASS package in R, which stands for “Modern Applied Statistics with S.” To utilize fitdistr(), you need to have the MASS package installed. If you haven’t installed it already, use install.packages("MASS") to install the package. Once the package is installed, you can load it into your current R session with library(MASS).

Here’s the basic syntax of the fitdistr() function:

fitdistr(x, densfun, start, ...)

Here’s a brief explanation of the arguments:

• x: A numeric vector of quantiles. These are the data points that the distribution is being fitted to.
• densfun: A character string naming the density function to fit.
• start: A named list giving the parameters’ initial values to be optimized over. The names must match the argument names of the density function.
• : Other parameters to be passed to the optimization function optim().

The fitdistr() function returns an object of class “fitdistr” which is a list with following components:

• estimate: The estimated parameters.
• sd: The standard deviation of the parameter estimate.
• vcov: The covariance matrix of the parameters.
• n: The number of observations.

## 2. Fitting a Normal Distribution

One of the most common applications of fitdistr() is to fit a normal distribution to a data set. To fit a normal distribution, you need to estimate the mean (mu) and standard deviation (sd). The fitdistr() function automatically uses the mean and standard deviation of your data as initial estimates for these parameters.

Here’s an example:

# Load the MASS package
library(MASS)

# Create a random data set
set.seed(123)
data <- rnorm(1000, mean=5, sd=2)

# Fit a normal distribution
fit <- fitdistr(data, "normal")

print(fit)

In this example, fitdistr() is estimating the parameters of the normal distribution (mu and sd) that best fit the data. The output of this script will include the estimated parameters and their standard errors.

## 3. Fitting Other Distributions

Besides the normal distribution, fitdistr() can also fit other distributions, such as the Poisson, exponential, gamma, and lognormal distributions. The only difference is that you need to provide different initial values for the parameters of these distributions.

Here’s an example of fitting a gamma distribution:

# Create a random data set
data <- rgamma(1000, shape=2, scale=2)

# Fit a gamma distribution
fit <- fitdistr(data, "gamma", list(shape=1, scale=1))

print(fit)

In this example, fitdistr() is estimating the parameters of the gamma distribution (shape and scale) that best fit the data. The startargument provides initial estimates for these parameters.

## 4. Visualizing Fitted Distributions

After fitting a distribution to your data with fitdistr(), you can visualize the fitted distribution by overlaying it on a histogram of your data.

Here’s an example:

# Load required packages
library(MASS)
library(ggplot2)

# Create a random data set
set.seed(123)
data <- rnorm(1000, mean=5, sd=2)

# Fit a normal distribution
fit <- fitdistr(data, "normal")

# Create a histogram and overlay the fitted distribution
ggplot(data.frame(data), aes(data)) +
geom_histogram(aes(y=..density..), bins=30, fill="skyblue", color="black") +
stat_function(fun=dnorm, args=list(mean=fit$estimate, sd=fit$estimate),
color="red", size=1.2) +
theme_minimal() +
labs(x="Data", y="Density",
title="Histogram with Fitted Normal Distribution")


In this example, ggplot2 is used to create a histogram of the data, and stat_function() is used to overlay the fitted normal distribution. The mean and standard deviation of the fitted distribution are obtained from the fit object.

## 5. Comparing Multiple Distributions

In some cases, you might want to compare the fit of multiple distributions to your data. You can do this by fitting multiple distributions with fitdistr() and comparing the results.

One way to compare the fits is to use the Akaike Information Criterion (AIC), which is a measure of the quality of a statistical model. The AIC takes into account both the goodness of fit and the complexity of the model, with a lower AIC indicating a better fit.

Here’s an example:

# Create a random data set
data <- rexp(1000, rate=1)

# Fit an exponential distribution
fit_exp <- fitdistr(data, "exponential")
aic_exp <- AIC(fit_exp)

# Fit a gamma distribution
fit_gamma <- fitdistr(data, "gamma", list(shape=1, scale=1))
aic_gamma <- AIC(fit_gamma)

# Compare AIC values
print(paste("AIC for exponential fit:", aic_exp))
print(paste("AIC for gamma fit:", aic_gamma))

In this example, fitdistr() is used to fit both an exponential and a gamma distribution to the data. The AIC of each fit is then calculated with AIC(), and the results are printed to the console.

## 6. Practical Applications of fitdistr() in Data Science

The fitdistr() function is invaluable in data science, particularly in statistical modeling and inference. By fitting a particular distribution to data, we are able to make assumptions about the underlying process that generates the data, hence informing our statistical tests, models, and predictions.

In addition to modeling, fitdistr() is also used in hypothesis testing, where it helps in conducting goodness of fit tests. Here, we compare the fitted distribution to the empirical distribution to see if they are significantly different.

## Conclusion

The fitdistr() function from the MASS package is a potent tool in the R programming language that allows you to fit different univariate distributions to your data. The function helps in estimating the parameters of the given distribution that best explain your data, and its applications are abundant in various areas, including data science, finance, and engineering.

Posted in RTagged