Model selection plays an essential role in statistical modeling, especially when we have multiple predictor variables. One of the challenges that data analysts and researchers face is determining the subset of predictors that should be included in the final model. In R, the regsubsets()
function from the leaps
package provides a comprehensive solution for this challenge. This article will provide a deep understanding of the usage, interpretation, and nuances of the regsubsets()
function for model selection.
Overview:
- Introduction to Model Selection
- Basics of
regsubsets()
- Exploring and Interpreting Output
- Criteria for Model Selection
- Plotting and Visualization
- Predictions and Validation
- Conclusion
1. Introduction to Model Selection
When we have multiple predictors, it’s not always clear which combination of variables will produce the best model. By “best,” we might mean the most predictive, the simplest, or some combination of the two. Selecting the optimal model involves considering:
- Underfitting: Too few predictors can miss out on significant relationships.
- Overfitting: Too many predictors can lead to a model that performs well on training data but poorly on new data.
Various techniques, like forward selection, backward elimination, and best subset selection, help in model selection.
2. Basics of regsubsets()
To use regsubsets()
, you first need to install and load the leaps
package:
install.packages("leaps")
library(leaps)
The basic usage is:
results <- regsubsets(formula, data, nbest, method)
Where:
formula
: A symbolic description of the model.data
: The dataset being used.nbest
: How many of the best models of each size to report. Default is 1.method
: Can be “exhaustive”, “forward”, or “backward”.
Example:
library(leaps)
data(mtcars)
results <- regsubsets(mpg ~ ., data = mtcars)
3. Exploring and Interpreting Output
After running regsubsets()
, you can explore the results with:
summary(results)
This output includes:
- Which predictors are included in the best model of each size: Indicated by asterisks.
- Residual Sum of Squares (RSS)
- Adjusted R-squared
- Cp (Mallow’s Cp)
- Bayesian Information Criterion (BIC)
These statistics help decide the optimal number of predictors.
4. Criteria for Model Selection
When selecting a model, various criteria can be used:
- Adjusted R-squared: Accounts for the number of predictors in the model. Higher values are preferred.
- Mallow’s Cp: A model with Cp close to the number of predictors plus the intercept is preferred.
- BIC: Lower values indicate better models.
5. Plotting and Visualization
Visualization can aid in understanding which model size is optimal. Use the plot()
function to visualize statistics for different model sizes:
plot(results, scale = "r2")

You can replace "r2"
with "adjr2"
, "Cp"
, or "bic"
to plot other criteria.
6. Predictions and Validation
Once you decide on an optimal model size, you can extract the coefficients and use them for predictions:
coefficients(results, id = 4) # Extract coefficients for the best 4-variable model.
7. Conclusion
The regsubsets()
function in R is a powerful tool for tackling the challenges of model selection when faced with multiple predictor variables. By examining various criteria, both statistical and graphical, you can make an informed decision about which variables to include in your final model. It’s essential, however, to also consider domain knowledge and the purpose of the model when making final decisions. Remember that while regsubsets()
offers a structured approach to model selection, the final judgment always lies in the hands of the analyst.