DFFITS is a diagnostic statistic used to evaluate the influence of individual data points on a linear regression model. When analyzing regression models, it’s critical to identify observations that have a disproportionate influence on the fitted model. DFFITS is one of the metrics you can use for this purpose.
In this article, we’ll explore what DFFITS is, why it’s important, and how to calculate and interpret it in R.
1. Understanding DFFITS
The term DFFITS is an acronym for “Difference in Fits.” It measures how much the predicted value for a specific observation changes when the observation is removed from the dataset and the model is refitted.
Formula for DFFITS:
- y^i is the predicted value for observation ii using the full model.
- y^i(i) is the predicted value for observation ii when observation ii is removed from the model.
- hii is the leverage of the ithobservation.
- MSE is the Mean Squared Error of the full model.
2. Data Preparation
We’ll use the built-in
mtcars dataset for demonstration purposes.
3. Simple Linear Regression and DFFITS
We’ll use the
lm() function to fit a simple linear regression model, predicting
mpg based on
wt in the
simple_model <- lm(mpg ~ wt, data = mtcars)
Now, to calculate DFFITS, you can use the
dffits() function in R.
dffits_simple <- dffits(simple_model)
4. Multiple Linear Regression and DFFITS
For a more complex example, let’s consider a multiple linear regression model with
mpg as the dependent variable and
hp as independent variables.
multiple_model <- lm(mpg ~ wt + hp, data = mtcars)
And similarly, calculating DFFITS:
dffits_multiple <- dffits(multiple_model)
5. Interpreting DFFITS
A general rule of thumb is that a DFFITS value larger than
is influential, where k is the number of predictors and n is the sample size.
6. Visualizing DFFITS
To visualize the DFFITS values, you can use a scatter plot:
plot(dffits_simple, type = 'h', main = 'DFFITS for Simple Linear Regression')
7. Comparison with Other Influence Metrics
DFFITS is just one of many influence metrics available, such as Cook’s Distance, leverage values, and studentized residuals. Each has its advantages and drawbacks, and often it’s beneficial to use them in conjunction.
Understanding DFFITS is essential for anyone who aims to create reliable and robust regression models. It’s a powerful tool for identifying observations that unduly influence the model, and knowing how to calculate and interpret it is crucial for effective statistical modeling.