How to Calculate Studentized Residuals in Python

Spread the love

Residuals play a vital role in evaluating the goodness of fit for linear regression models. While raw residuals provide an initial perspective on model performance, standardizing or ‘studentizing’ these residuals can give a more comprehensive understanding of the model’s predictive ability. Studentized residuals are particularly useful for detecting outliers and heteroscedasticity (unequal scatter) in the residuals.

In this article, we will delve into the concept of Studentized residuals, also known as externally studentized residuals, discuss their advantages, and provide a step-by-step guide on how to calculate them in Python using libraries such as numpy, statsmodels, and scipy.

Understanding Studentized Residuals

Studentized residuals are a type of standardized residuals. Standardized residuals are the residuals divided by their standard deviation. Studentized residuals take this a step further and account for potential variations in the standard deviation of the residuals.

The studentized residual for the ith observation is computed as follows:

Studentized Residual = (Residual_i) / (sqrt(MSE_(i)))

Here, Residual_i is the ith raw residual, MSE_(i) is the mean square error computed with the ith observation removed. By computing the MSE with the ith observation removed, we can identify if a particular data point is an outlier or is overly influential in the regression analysis.

Benefits of Studentized Residuals

Studentized residuals are more reliable than raw residuals because they account for variability in the data. This makes them especially valuable in identifying outliers, which can unduly influence the model’s performance and result in an inaccurate or unstable model.

Another advantage of studentized residuals is their ability to reveal heteroscedasticity (unequal scatter) in the residuals. Heteroscedasticity is a violation of one of the assumptions of linear regression, and identifying this can help improve the model.

Calculating Studentized Residuals in Python

In Python, we can calculate Studentized residuals by using a combination of numpy, statsmodels, and scipy libraries.

Step 1: Import Required Libraries

First, import the required libraries:

import numpy as np
import statsmodels.api as sm
import scipy.stats as stats

Step 2: Create or Load Dataset

For simplicity, let’s create a simple linear regression problem with one predictor variable and one response variable:

# predictor variable
X = np.random.rand(100, 1)

# response variable
Y = 1 + 2*X + np.random.randn(100, 1)

Here, Y is a linear function of X with some added Gaussian noise.

Step 3: Fit a Linear Regression Model

Next, we’ll fit a linear regression model to our data using statsmodels:

# add constant to predictor variables
X = sm.add_constant(X)

# fit the model
model = sm.OLS(Y, X).fit()

Step 4: Calculate Raw Residuals

To calculate the raw residuals, subtract the predicted values of Y from the actual values of Y:

# calculate raw residuals
raw_residuals = model.resid

Step 5: Calculate MSE with the ith Observation Removed

To calculate the mean square error with the ith observation removed (MSE_(i)), we’ll leverage the leverage and the mean squared error from the regression model:

# calculate leverage
hat_matrix = X.dot(np.linalg.inv(X.T.dot(X))).dot(X.T)
leverage = np.diag(hat_matrix)

# calculate MSE with the ith observation removed
mse_i = (sum(raw_residuals**2) - raw_residuals**2 / (1 - leverage)) / (len(X) - X.shape[1] - 1)

Step 6: Calculate Studentized Residuals

Finally, we’ll calculate the studentized residuals:

# calculate studentized residuals
studentized_residuals = raw_residuals / np.sqrt(mse_i)

Step 7: Check for Outliers

The studentized residuals can help us identify outliers in our data. Any data point with a studentized residual greater than 3 or less than -3 can be considered an outlier:

# identify outliers
outliers = np.absolute(studentized_residuals) > 3
print(outliers)

Conclusion

Studentized residuals are a powerful tool for evaluating the performance of a linear regression model, identifying outliers, and detecting heteroscedasticity. Understanding how to calculate and interpret studentized residuals can help improve the robustness and reliability of your regression analysis.

By walking through the process of calculating studentized residuals in Python, we have seen how libraries like numpy, statsmodels, and scipy can streamline the process. By taking advantage of these libraries and understanding the underlying principles, you can effectively assess and refine your linear regression models.

Leave a Reply