How to Create a Residual Plot in Python

Spread the love

Introduction

Residual plots are a vital part of regression analysis, playing a critical role in validating the assumptions of a model. A residual is the difference between the observed value and the predicted value of a data point. In a perfect model, the residuals would be randomly scattered around zero for all predicted values, exhibiting no apparent pattern.

If the residuals do show a pattern, that’s a signal that our model isn’t capturing some aspect of the data – maybe the relationship isn’t linear, or there’s a variable we didn’t consider, or the data has too much noise. By visualizing these residuals, we can uncover these insights and improve our model.

In Python, we can create residual plots with various libraries, including Matplotlib, Seaborn, and StatsModels. This article will guide you through the process of creating a residual plot in Python using these popular libraries.

Prerequisites

Before we proceed, ensure you have the following Python packages installed:

  • numpy
  • pandas
  • matplotlib
  • seaborn
  • sklearn
  • statsmodels

If not, you can install them using pip:

pip install numpy pandas matplotlib seaborn sklearn statsmodels

Data Preparation

First, let’s prepare some sample data. We will use the Boston Housing dataset from sklearn datasets:

from sklearn import datasets

# Load Boston housing dataset
boston = datasets.load_boston()
print(boston.DESCR)

This dataset has 506 instances, 13 numerical/categorical attributes, and a target variable MEDV, which is the Median value of owner-occupied homes in $1000s. We’ll use the RM (average number of rooms per dwelling) variable to predict the MEDV.

Linear Regression Model

Let’s fit a linear regression model using sklearn’s LinearRegression:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Prepare DataFrame
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target

# Feature and target
X = boston_df[['RM']]
y = boston_df['MEDV']

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
lm = LinearRegression()
lm.fit(X_train, y_train)

Once we have fitted the model, we can predict values for the test set and calculate the residuals:

# Predicting values
y_pred = lm.predict(X_test)

# Calculating residuals
residuals = y_test - y_pred

Creating a Residual Plot

Using Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Here is how we create a residual plot using Matplotlib:

import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(y_pred, residuals, alpha=0.5)

# Title and labels
plt.title('Residual Plot')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')

# Show the plot
plt.show()

Using Seaborn

Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. Here is how we create a residual plot using Seaborn:

import seaborn as sns

# Residual plot
sns.residplot(x=y_pred, y=residuals)

# Show the plot
plt.show()

Seaborn’s residplot() function automatically fits a low-order polynomial regression model that estimates the conditional mean of the residuals and includes this regression line in the plot.

Using StatsModels

StatsModels is a Python library built specifically for statistics. It is built on top of NumPy, SciPy, and matplotlib. Here’s how you can create a residual plot using StatsModels:

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Define the model
model = smf.ols(formula='MEDV ~ RM', data=boston_df).fit()

# Plot residuals
sm.graphics.plot_regress_exog(model, 'RM', fig=plt.figure(figsize=(12,8)))
plt.show()

The plot_regress_exog() function provides four plots, one of them being the residual plot.

Interpretation

When interpreting the residual plot, we look for patterns.

  1. If the residuals are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
  2. If you can see a pattern, like a curve or a U shape, this is an indication that the model isn’t capturing the non-linear nature of the data.
  3. If the residuals are spread equally along the ranges of your predictors, that’s good. If they spread out as your prediction increases, that might mean that your model is more unpredictable for higher values.
  4. If you see clusters of points, that suggests that your data might be split into groups that should be modeled separately.

Conclusion

In this article, we’ve looked at how to create a residual plot in Python using Matplotlib, Seaborn, and StatsModels. Creating a residual plot is a crucial step in validating the assumptions of your regression model, and it can reveal valuable insights about your data and the performance of your model. Remember, in a good model, residuals should be randomly scattered around zero for all predicted values, showing no apparent pattern. If there is a pattern in the residuals, consider revisiting your model to better capture the underlying structure of your data.

Leave a Reply