
The Variance Inflation Factor (VIF) is a measure of multicollinearity among the predictors in a regression model. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of multicollinearity.
In this comprehensive article, we’ll discuss VIF, its implications, and guide you on how to calculate VIF in Python.
Understanding Variance Inflation Factor (VIF)
VIF is a measure that allows us to quantify the severity of multicollinearity in an ordinary least squares regression analysis. Multicollinearity refers to the situation where the independent variables (predictors) in a regression model are highly correlated. This high correlation can lead to unstable estimates of the regression coefficients, which can make it difficult to determine the individual contributions of the predictors to the response.
The Variance Inflation Factor quantifies how much the variance of the estimated regression coefficients are inflated as compared to when the predictors are not linearly related.
A VIF of 1 indicates that there is no correlation among the kth predictor and the remaining predictor variables, and hence, the variance of the estimated regression coefficients is not inflated at all. On the other hand, a VIF greater than 1 indicates the presence of multicollinearity. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of multicollinearity.
Loading and Preparing the Data
For this tutorial, we’ll use the ‘mtcars’ dataset, a well-known dataset available in the statsmodels library. This dataset consists of fuel consumption and ten aspects of automobile design and performance for 32 automobiles.
import statsmodels.api as sm
# Load the dataset
data = sm.datasets.get_rdataset('mtcars').data
print(data.head())
Fitting a Regression Model
To illustrate how to calculate the VIF, we’ll fit an Ordinary Least Squares (OLS) regression model to our data, with the miles per gallon (mpg) as the dependent variable and displacement (disp), horsepower (hp), and weight (wt) as the independent variables.
# Define the dependent variable
y = data['mpg']
# Define the independent variables
X = data[['disp', 'hp', 'wt']]
# Add a constant to the independent variables matrix
X = sm.add_constant(X)
# Fit the OLS model
model = sm.OLS(y, X).fit()
Calculating VIF
We can use the variance_inflation_factor
function from the statsmodels library to calculate the VIF for each predictor variable. To use this function, we need to specify the matrix of predictors and the index of the variable for which we want to compute the VIF.
Here’s how to calculate the VIF for all our predictors:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Get variables for which to compute VIF and add intercept term
X = sm.add_constant(data[['disp', 'hp', 'wt']])
# Compute and view VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
Interpreting VIF
The output gives us the VIF for each variable. A VIF close to 1 indicates that the variable is not correlated with the other variables, and hence its variance is not inflated at all. A VIF greater than 1 suggests the presence of multicollinearity.
As a rule of thumb, a VIF above 5 indicates a high multicollinearity between this variable and the others, and above 10 is very high multicollinearity.
If the VIF is high, we have two main ways of fixing multicollinearity:
- Removing variables: The easiest way to address multicollinearity is simply to remove one of the variables that’s providing redundant information.
- Combining variables: In some cases, it may be most appropriate to combine the correlated variables into one. For example, if two variables are highly correlated, you can create their average or principal component to use in your model.
Conclusion
In this article, we explored the Variance Inflation Factor (VIF), a measure of multicollinearity in a regression analysis. Understanding and addressing multicollinearity is crucial in regression analysis because it can lead to inflated variances of the estimated regression coefficients, leading to less reliable and less interpretable model results. By correctly calculating and interpreting VIF in Python, you can ensure that your regression models are reliable, and your conclusions are valid.