### Introduction

R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model. In other words, it provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

R-squared values range from 0 to 1. An R-squared of 100% indicates that all changes in the dependent variable are completely explained by changes in the independent variable(s). Conversely, an R-squared of 0% indicates that the model explains none of the variability of the response data around its mean.

In this article, we’ll guide you through how to calculate R-squared in Python using the `scikit-learn`

and `statsmodels`

libraries.

### Data Preparation

For our demonstration, we’ll use the Boston Housing dataset from the `sklearn`

datasets:

```
from sklearn import datasets
import pandas as pd
# Load Boston housing dataset
boston = datasets.load_boston()
# Prepare DataFrame
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target
```

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

### Fitting a Linear Regression Model

Let’s fit a linear regression model to our data. We’ll use the ‘RM’ feature (average number of rooms per dwelling) to predict ‘MEDV’ (Median value of owner-occupied homes).

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Features and target
X = boston_df[['RM']]
y = boston_df['MEDV']
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
lm = LinearRegression()
lm.fit(X_train, y_train)
```

### Calculating R-Squared using Scikit-Learn

Once we have fitted the model, we can calculate R-squared using the `.score()`

method from the `LinearRegression`

object:

```
# Calculate R-squared
r_squared = lm.score(X_test, y_test)
print(f'R-squared: {r_squared}')
```

The `.score()`

method returns the R-squared of a linear regression model on the given test data and labels.

### Calculating R-Squared using StatsModels

We can also calculate R-squared using the `statsmodels`

library. Unlike scikit-learn, `statsmodels`

doesn’t automatically add a constant to our data and we need to add it manually.

```
import statsmodels.api as sm
# Add a constant to the independent variables
X = sm.add_constant(X)
# Fit the model
model = sm.OLS(y, X).fit()
# Print out the R-squared
print(f'R-squared: {model.rsquared}')
```

The `sm.OLS()`

method fits a linear regression model using Ordinary Least Squares, and the `.fit()`

method fits the model to our data. Once we have the fitted model, we can get the R-squared by calling `.rsquared`

on the fitted model.

### Interpretation

R-squared is a statistical measure that tells us the percentage of the response variable variation that is explained by a linear model. Or:

- R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

- 0% indicates that the model explains none of the variability of the response data around its mean.
- 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline:

- R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.
- R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

### Conclusion

R-squared is a useful statistical measure for determining the goodness of fit of a regression model. In Python, it can be computed easily using libraries such as `scikit-learn`

and `statsmodels`

. However, it is just one of the metrics to evaluate model performance and it is crucial to use other validation techniques and metrics to get a holistic view of your model’s performance.