How to Calculate Adjusted R-Squared in Python

Spread the love

Introduction

When building a regression model, one of the most critical metrics to evaluate the goodness of fit of the model is R-squared. It tells us the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, and a higher R-squared represents a better fit of the model.

However, there’s a problem with R-squared – it never decreases with the addition of more predictors to the model. It either stays the same or increases, even if the predictors are not really improving the model. This is where the Adjusted R-squared comes into play. It adjusts the statistic based on the number of independent variables in the model. Unlike R-squared, the Adjusted R-squared will decrease if unnecessary predictors are included in the model.

In this article, we will guide you on how to calculate the Adjusted R-squared in Python using the statsmodels library.

Data Preparation

We’ll use the Boston Housing dataset from the sklearn datasets to illustrate the calculation of the Adjusted R-squared:

from sklearn import datasets
import pandas as pd

# Load Boston housing dataset
boston = datasets.load_boston()

# Prepare DataFrame
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target

Fitting a Linear Regression Model

Let’s start by fitting a simple linear regression model using the Ordinary Least Squares (OLS) method from the statsmodels library:

import statsmodels.api as sm

# Define independent and dependent variables
X = boston_df.drop('MEDV', axis=1)
y = boston_df['MEDV']

# Add a constant to the independent variables
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

Here, we’re using all the features in the dataset (excluding the target ‘MEDV’) as our independent variables.

Calculating Adjusted R-Squared

The statsmodels library makes it easy to retrieve the Adjusted R-squared directly from the fitted model:

# Print out the Adjusted R-squared
print(f'Adjusted R-squared: {model.rsquared_adj:.4f}')

The .rsquared_adj attribute of the fitted model gives us the Adjusted R-squared value. This value gives us a better measure of how well our model generalizes and helps prevent overfitting by penalizing the inclusion of unnecessary predictors.

Interpretation

The Adjusted R-squared compensates for the addition of variables and only increases if the new predictor enhances the model more than expected by chance. It decreases when a predictor improves the model less than expected by chance. The Adjusted R-squared is always less than or equal to R-squared.

A model with more predictors can seem to have a better fit just because it’s more complicated. An Adjusted R-squared helps you maintain parsimony – it encourages you to keep your models simple.

An important thing to note is that while a higher Adjusted R-squared denotes a model with a better fit, this doesn’t always mean it’s a better model. No single statistic can tell you everything about a model’s quality, and it’s essential to use various metrics and plots to evaluate it.

Conclusion

The Adjusted R-squared is a valuable tool for measuring the goodness-of-fit of a regression model and helping prevent overfitting by adjusting for the number of predictors in the model. It provides a more realistic picture of the model’s performance, especially when comparing models with a different number of predictors.

In Python, calculating the Adjusted R-squared is simple using the statsmodels library. However, it’s crucial to interpret it within the context of your model and remember that it’s only one of the many metrics used to evaluate a model’s performance.

Leave a Reply