How to Test for Multicollinearity in Python

Spread the love

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a regression model are highly correlated. In other words, one predictor variable can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. It can also result in overfitting, where your model will perform well on your training data but poorly on unseen data.

In this article, we’ll explore how to test for multicollinearity in Python. We’ll make use of several libraries and techniques to do so, including pandas for data handling, NumPy for numerical operations, matplotlib and seaborn for visualization, and statsmodels and sklearn for statistical modeling.

Prerequisites

Before we start, make sure you have the required libraries installed. If not, you can install them using pip:

pip install pandas numpy matplotlib seaborn statsmodels scikit-learn

Loading and Exploring the Data

We’ll use the Boston Housing dataset as an example for this tutorial, which is a built-in dataset in the sklearn library. Let’s load the data and explore it:

from sklearn.datasets import load_boston
import pandas as pd

# Load the dataset
boston = load_boston()

# Create a DataFrame
df = pd.DataFrame(boston.data, columns=boston.feature_names)

# Display the first few rows
print(df.head())

This dataset contains 13 features, such as average number of rooms per dwelling (RM), per capita crime rate by town (CRIM), etc. Our goal is to predict the median value of owner-occupied homes (MEDV).

Testing for Multicollinearity

There are several ways to test for multicollinearity, and we’ll explore some of the most common methods: correlation matrix, variance inflation factor (VIF), and condition index.

Correlation Matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables.

Let’s create a correlation matrix for our dataset using pandas and visualize it using seaborn:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

In the correlation matrix, the closer the color is to red, the higher the correlation, and the closer it is to blue, the more negative the correlation. If two variables have a high correlation (close to 1 or -1), they might be causing multicollinearity in your regression model.

Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. The formula for calculating VIF is 1/(1-R^2), where R^2 is the coefficient of determination in linear regression.

Let’s calculate the VIF for our dataset:

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Add a constant and build the OLS model
df['CONSTANT'] = 1
X = df

# Calculate VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

Generally, a VIF above 5 indicates high multicollinearity.

Condition Index

The condition index is a measure of the sensitivity of a function’s output as estimated by a statistical model to its input. When applied to multicollinearity in the multiple regression context, the function is the vector of parameters, the input is the design matrix, and the output is the vector of observations.

Let’s calculate the condition index for our dataset:

from numpy import linalg as LA

# Normalize the dataset
X = (X - X.mean()) / X.std()

# Calculate the eigenvalues of the dot product of the dataframe
eigenvalues = LA.eigvals(np.dot(X.T, X))

# Compute condition index
condition_index = np.sqrt(eigenvalues.max() / eigenvalues)

print(condition_index)

A condition index above 30 indicates severe multicollinearity.

Dealing with Multicollinearity

There are several ways to handle multicollinearity once detected. Here are a few:

  1. Remove one of the correlated variables: If two variables are highly correlated, consider removing one of them.
  2. Combine the correlated variables: For instance, if you have “height in inches” and “height in feet”, you might consider using just one or creating a new variable that captures the same information.
  3. Apply regularization methods: Ridge regression and Lasso regression are two methods that add a penalty term in the loss function during the training process, which can help handle multicollinearity.
  4. Increase the sample size: Adding more data can sometimes help in reducing the problem of multicollinearity.

Remember that multicollinearity isn’t always a problem that needs to be fixed. If your goal is to predict, not to interpret the coefficients, then multicollinearity might not be a problem. In fact, removing a variable could lead to worse predictions.

Conclusion

In this article, we’ve covered what multicollinearity is, how to detect it in Python, and some ways to handle it. We saw how to use the correlation matrix, VIF, and condition index as tools for diagnosing multicollinearity.

Remember, not all correlation is bad, and multicollinearity may not affect the accuracy of your predictions. It’s more of an issue if you’re trying to understand which features are contributing the most to your predictions, as it can affect the interpretability of your model. So use your judgement, and happy coding!

Leave a Reply