How to Calculate SST, SSR, and SSE in Python

Spread the love

The SST (Total Sum of Squares), SSR (Regression Sum of Squares), and SSE (Sum of Squared Errors) are critical metrics in regression analysis. They are used to evaluate the goodness-of-fit of a regression model, with SST being the total variability in the dataset, SSR representing the part of the variability explained by the regression model, and SSE being the part of the variability that the model failed to capture.

In this article, we will go through how to calculate these metrics using Python. We will be using the numpy and sklearn libraries, so make sure you have them installed.

Step 1: Import Necessary Libraries

First, you’ll need to import the libraries necessary for this task.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Generate or Import Your Data

We’ll create a simple dataset for illustration purposes.

# Generate dataset
np.random.seed(0)  # for reproducibility
X = np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

In this code, we are generating data for a simple linear regression, with the true relationship being y = 4 + 3x + noise.

Step 3: Split Your Data into Training and Testing Sets

Next, split your data into a training set and a test set.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Fit Your Regression Model

You can now fit your regression model using the training data.

# Create a LinearRegression instance
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

Step 5: Make Predictions

After the model has been trained, you can make predictions on the test set.

# Make predictions
y_pred = model.predict(X_test)

Step 6: Calculate SSE, SSR, and SST

Now, we can calculate the SSE, SSR, and SST.

# Calculate the mean y
y_mean = np.mean(y_test)

# Calculate SSE
sse = np.sum((y_test - y_pred) ** 2)

# Calculate SSR
ssr = np.sum((y_pred - y_mean) ** 2)

# Calculate SST
sst = np.sum((y_test - y_mean) ** 2)

print(f"SSE: {sse}")
print(f"SSR: {ssr}")
print(f"SST: {sst}")

SSE, SSR, and SST have the following interpretations:

  • SSE: This is the sum of the squares of the prediction errors, which measures the discrepancy between the data points and the estimation model. A smaller SSE indicates a better fit of the model to the data.
  • SSR: This measures the amount of variability in the response variable that is explained by the model. A larger SSR indicates a better fit of the model.
  • SST: This measures the total variability in the response variable.

Remember, in a good regression model, the SSR should be high and the SSE should be low.

One important property of these metrics is that SST = SSR + SSE. This relation provides the basis for the calculation of the coefficient of determination, R-squared (R²), which is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

We hope this article provides a comprehensive understanding of how to calculate SST, SSR, and SSE in Python. Understanding these concepts is essential in interpreting the results of your regression model, and in assessing its quality and goodness-of-fit.

Leave a Reply