How to Calculate R-Squared in Python

Spread the love

Introduction

R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model. In other words, it provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

R-squared values range from 0 to 1. An R-squared of 100% indicates that all changes in the dependent variable are completely explained by changes in the independent variable(s). Conversely, an R-squared of 0% indicates that the model explains none of the variability of the response data around its mean.

In this article, we’ll guide you through how to calculate R-squared in Python using the scikit-learn and statsmodels libraries.

Data Preparation

For our demonstration, we’ll use the Boston Housing dataset from the sklearn datasets:

from sklearn import datasets
import pandas as pd

# Load Boston housing dataset
boston = datasets.load_boston()

# Prepare DataFrame
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

Fitting a Linear Regression Model

Let’s fit a linear regression model to our data. We’ll use the ‘RM’ feature (average number of rooms per dwelling) to predict ‘MEDV’ (Median value of owner-occupied homes).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Features and target
X = boston_df[['RM']]
y = boston_df['MEDV']

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
lm = LinearRegression()
lm.fit(X_train, y_train)

Calculating R-Squared using Scikit-Learn

Once we have fitted the model, we can calculate R-squared using the .score() method from the LinearRegression object:

# Calculate R-squared
r_squared = lm.score(X_test, y_test)

print(f'R-squared: {r_squared}')

The .score() method returns the R-squared of a linear regression model on the given test data and labels.

Calculating R-Squared using StatsModels

We can also calculate R-squared using the statsmodels library. Unlike scikit-learn, statsmodels doesn’t automatically add a constant to our data and we need to add it manually.

import statsmodels.api as sm

# Add a constant to the independent variables
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print out the R-squared
print(f'R-squared: {model.rsquared}')

The sm.OLS() method fits a linear regression model using Ordinary Least Squares, and the .fit() method fits the model to our data. Once we have the fitted model, we can get the R-squared by calling .rsquared on the fitted model.

Interpretation

R-squared is a statistical measure that tells us the percentage of the response variable variation that is explained by a linear model. Or:

  • R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline:

  • R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.
  • R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

Conclusion

R-squared is a useful statistical measure for determining the goodness of fit of a regression model. In Python, it can be computed easily using libraries such as scikit-learn and statsmodels. However, it is just one of the metrics to evaluate model performance and it is crucial to use other validation techniques and metrics to get a holistic view of your model’s performance.

Leave a Reply