
In this post, you will learn how to do feature selection with SelectKBest in scikit Learn.
Why we do Feature Selection ?
Some of the reasons for doing feature selection are –
1 . Getting more interpretable model
2 . Faster prediction and training
3 . Less storage for model and data
How to do Feature Selection with SelectKBest?
The SelectKBest method select features according to the k highest scores. For regression problems we use different scoring functions like f_regression and for classification problems we use chi2 and f_classif.
SelectkBest for Regression –
Let’s first look at the regression problems.
# regression dataset
housing = datasets.fetch_california_housing()
X_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
y_housing = housing.target
X_housing

y_housing
array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])
# split data into training and test set
from sklearn.model_selection import train_test_split
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(
X_housing, y_housing, test_size=0.2, random_state=42)
Let’s say that we want to only keep 4 most informative features out of 8 features in this dataset.
from sklearn.feature_selection import SelectKBest, f_regression
select_reg = SelectKBest(k=4, score_func=f_regression)
select_reg.fit(X_train_housing, y_train_housing)
X_train_housing_new = select_reg.transform(X_train_housing)
X_train_housing_new.shape
output -
(16512, 4)
To know which features get kept by SelectKBest, we can use the get_support() method.
kept_features = pd.DataFrame({'columns': X_train_housing.columns,
'Kept': select_reg.get_support()})
kept_features

To get the new dataframe, you can use the following code.
new_df = X_train_housing.iloc[:,select_reg.get_support()]
new_df

SelectKBest for Classification –
# classification dataset
cancer = datasets.load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target
# split data in training and test set
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
X_cancer, y_cancer, test_size=0.2, random_state=42)
Let’s say we want to only keep 10 most informative features.
from sklearn.feature_selection import SelectKBest, chi2
select_class = SelectKBest(k=10, score_func=chi2)
select_class.fit(X_train_cancer, y_train_cancer)
X_train_cancer_new = select_class.transform(X_train_cancer)
print("Num Features before:", X_train_cancer.shape[1])
print("Num Features after:", X_train_cancer_new.shape[1])
output -
Num Features before: 30
Num Features after: 10
new_cancer_df = X_train_cancer.iloc[:,select_class.get_support()]
new_cancer_df
