Feature Selection with Recursive Feature Elimination (RFECV)

Spread the love

How recursive feature elimination (RFE) works?

In recursive feature elimination, we repeatedly train a model multiple times and each time we remove the least important feature from model determined by coef_ or feature_importance_ attribute of the model. We do this process until the model performance become worse. At last we were left with the features that are most important.

How to do Recursive Feature Elimination ?

Let’s read a dataset to work with.

import pandas as pd
from sklearn import datasets

# classification dataset
cancer = datasets.load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target

Now let’s apply recursive feature elimination with cross validation in scikit learn.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

# create a random forest model
rf = RandomForestClassifier(random_state=42)

# Recursively eliminate features with cross validation
rfecv = RFECV(estimator=rf, cv=5, scoring='accuracy')
rfecv.fit(X, y)
X_new = rfecv.transform(X)
print("Num Features Before:", X.shape[1])
print("Num Features After:", X_new.shape[1])

output - 
Num Features Before: 30
Num Features After: 16

To know which features are being kept we can use the support_ attribute. Truncated output shown below.

features_kept = pd.DataFrame({'columns': X.columns,
                             'Kept': rfecv.support_})
features_kept

We can also create a new dataframe using only the best features.

X_new_df = X.iloc[:, rfecv.support_]
X_new_df.head()

To see the ranking from best (1) to worst, we can use the ranking_ attribute.

rfecv.ranking_
output - 
array([ 1,  1,  1,  1,  4,  3,  1,  1, 14, 15,  1, 13,  6,  1,  9,  8,  7,
       11, 10, 12,  1,  1,  1,  1,  1,  1,  1,  1,  2,  5])

Rating: 1 out of 5.

Leave a Reply