
The Akaike Information Criterion (AIC) is a measure used for selecting among statistical models that have been fitted to data. It balances the complexity of a model against how well the model fits the data. In this comprehensive guide, we will learn how to calculate the AIC of regression models in Python.
Part 1: Understanding the Akaike Information Criterion (AIC)
Before jumping into Python code, it’s crucial to understand the theoretical underpinnings of AIC.
The Akaike Information Criterion (AIC) is a method for scoring and selecting a model. Named after the statistician Hirotugu Akaike, the AIC not only rewards the goodness of fit but also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is a problem when a model is too complex and captures the noise along with the underlying structure in the data.
The formula for AIC is:
AIC = 2k - 2ln(L)
Where:
- k is the number of parameters in the model.
- L is the maximum value of the likelihood function for the model.
The model with the lowest AIC is usually considered the best.
Part 2: Computing AIC in Python
With a firm grasp of the theory behind AIC, we can now delve into how to compute it in Python. We’ll use a dataset, fit a regression model to it, and then calculate the AIC for this model.
Step 1: Import Necessary Libraries
First, we import the necessary Python libraries.
import numpy as np
import pandas as pd
import statsmodels.api as sm
Step 2: Load and Preprocess Data
We will load and preprocess our data. Here, we assume that we have a dataset ‘data.csv’ with two independent variables ‘X1’ and ‘X2’ and one dependent variable ‘Y’. We load it using pandas and separate the independent and dependent variables.
# Load the dataset
data = pd.read_csv('data.csv')
# Split independent and dependent variables
X = data[['X1', 'X2']]
Y = data['Y']
Step 3: Create and Fit the Model
Next, we’ll create an Ordinary Least Squares (OLS) model with ‘Y’ as the dependent variable and ‘X1’ and ‘X2’ as the independent variables. We also add a constant to the independent variables.
# Add constant to independent variables
X = sm.add_constant(X)
# Create the model
model = sm.OLS(Y, X)
# Fit the model
results = model.fit()
Step 4: Calculate AIC
Statsmodels provides a straightforward way to compute AIC using the fitted model. The .aic
attribute of the result object from the fitted model gives the AIC value.
aic = results.aic
print('AIC: ', aic)
If you want to calculate AIC manually, you can use the formula mentioned earlier. First, get the number of parameters (k) and the maximized log-likelihood (LL).
k = len(results.params)
LL = results.llf
# Calculate AIC
aic_manual = 2*k - 2*LL
print('Manually computed AIC: ', aic_manual)
The value obtained from the manual calculation should match the one given by the results.aic
command.
Conclusion
In this guide, we have explained how to calculate the Akaike Information Criterion (AIC) in Python. AIC is an essential tool in model selection, helping to strike a balance between the complexity of a model and how well it fits the data. By comparing the AIC values of different models, we can select the one that best suits our data, while avoiding overfitting.