Feature Selection with SelectKBest in Scikit Learn.

Spread the love

In this post, you will learn how to do feature selection with SelectKBest in scikit Learn.

Why we do Feature Selection ?

Some of the reasons for doing feature selection are –

1 . Getting more interpretable model

2 . Faster prediction and training

3 . Less storage for model and data

How to do Feature Selection with SelectKBest?

The SelectKBest method select features according to the k highest scores. For regression problems we use different scoring functions like f_regression and for classification problems we use chi2 and f_classif.

SelectkBest for Regression –

Let’s first look at the regression problems.

# regression dataset
housing = datasets.fetch_california_housing()
X_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
y_housing = housing.target
X_housing
y_housing
array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])
# split data into training and test set
from sklearn.model_selection import train_test_split

X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42)

Let’s say that we want to only keep 4 most informative features out of 8 features in this dataset.

from sklearn.feature_selection import SelectKBest, f_regression
select_reg =  SelectKBest(k=4, score_func=f_regression)
select_reg.fit(X_train_housing, y_train_housing)               
X_train_housing_new = select_reg.transform(X_train_housing)
X_train_housing_new.shape

output - 
(16512, 4)

To know which features get kept by SelectKBest, we can use the get_support() method.

kept_features = pd.DataFrame({'columns': X_train_housing.columns,
                              'Kept': select_reg.get_support()})
kept_features

To get the new dataframe, you can use the following code.

new_df = X_train_housing.iloc[:,select_reg.get_support()]
new_df

SelectKBest for Classification –

# classification dataset
cancer = datasets.load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target

# split data in training and test set
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42)

Let’s say we want to only keep 10 most informative features.

from sklearn.feature_selection import SelectKBest, chi2

select_class = SelectKBest(k=10, score_func=chi2)
select_class.fit(X_train_cancer, y_train_cancer)
X_train_cancer_new = select_class.transform(X_train_cancer)
print("Num Features before:", X_train_cancer.shape[1])
print("Num Features after:", X_train_cancer_new.shape[1])

output - 
Num Features before: 30
Num Features after: 10
new_cancer_df = X_train_cancer.iloc[:,select_class.get_support()]
new_cancer_df

Rating: 1 out of 5.

Leave a Reply