How to Calculate DFFITS in R

Spread the love

DFFITS is a diagnostic statistic used to evaluate the influence of individual data points on a linear regression model. When analyzing regression models, it’s critical to identify observations that have a disproportionate influence on the fitted model. DFFITS is one of the metrics you can use for this purpose.

In this article, we’ll explore what DFFITS is, why it’s important, and how to calculate and interpret it in R.

1. Understanding DFFITS

The term DFFITS is an acronym for “Difference in Fits.” It measures how much the predicted value for a specific observation changes when the observation is removed from the dataset and the model is refitted.

Formula for DFFITS:

Where:

  • y^i is the predicted value for observation ii using the full model.
  • y^i(i)​ is the predicted value for observation ii when observation ii is removed from the model.
  • hii is the leverage of the ithobservation.
  • MSE is the Mean Squared Error of the full model.

2. Data Preparation

We’ll use the built-in mtcars dataset for demonstration purposes.

data(mtcars)

3. Simple Linear Regression and DFFITS

We’ll use the lm() function to fit a simple linear regression model, predicting mpg based on wt in the mtcars dataset.

simple_model <- lm(mpg ~ wt, data = mtcars)

Now, to calculate DFFITS, you can use the dffits() function in R.

dffits_simple <- dffits(simple_model)

4. Multiple Linear Regression and DFFITS

For a more complex example, let’s consider a multiple linear regression model with mpg as the dependent variable and wt and hp as independent variables.

multiple_model <- lm(mpg ~ wt + hp, data = mtcars)

And similarly, calculating DFFITS:

dffits_multiple <- dffits(multiple_model)

5. Interpreting DFFITS

A general rule of thumb is that a DFFITS value larger than

is influential, where k is the number of predictors and n is the sample size.

6. Visualizing DFFITS

To visualize the DFFITS values, you can use a scatter plot:

plot(dffits_simple, type = 'h', main = 'DFFITS for Simple Linear Regression')

7. Comparison with Other Influence Metrics

DFFITS is just one of many influence metrics available, such as Cook’s Distance, leverage values, and studentized residuals. Each has its advantages and drawbacks, and often it’s beneficial to use them in conjunction.

8. Conclusion

Understanding DFFITS is essential for anyone who aims to create reliable and robust regression models. It’s a powerful tool for identifying observations that unduly influence the model, and knowing how to calculate and interpret it is crucial for effective statistical modeling.

Posted in RTagged

Leave a Reply