How to Select Important Features of a Random Forest Model?

Spread the love

In our previous post we learned how to identify important features of a random forest model. In this post we will learn How to select these important features.

Selecting Important Features of a Random Forest Model –

Sometimes we may not want to include all the features of a model. May be we want to reduce the model’s variance or we might want to improve interpretability by including only the most important features.

In Scikit-Learn we can use a simple two-stage workflow to create a model with reduced features. First we train a random forest model using all features. Then, we use this model to identify the most important features. Next we create a new feature matrix that includes only these features. We can use the SelectFromModel method to create a feature matrix containing only features with an importance greater than or equal to some threshold value. Finally we create a new model using only those features.

Let’s read a dataset to illustrate it.

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv'
df = pd.read_csv(url)
df.head()

Split the data into training and test set.

from sklearn.model_selection import train_test_split

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now let’s build a Random forest model and select the important features and then train a new model using only those features.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# create a random forest classifier
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# create object that selects features with importance greater 
# than or equal to a threshold. Please choose your own threshold.
selector = SelectFromModel(rf, threshold=0.04)

# select new feature matrix using selector
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)

# train random forest using most important features
rf.fit(X_train_new, y_train)
# make predictions on test set
y_pred = rf.predict(X_test_new)
# measure accuracy
accuracy_score(y_test, y_pred)
# output
0.956140350877193
# number of features before feature selection
X_train.shape[1]
#output
30
# number of features after feature selection
X_train_new.shape[1]
9

Related Posts –

  1. How to Identify Important Features of a Random Forest Model?
  2. A Gentle Introduction to Random Forest in Machine Learning
  3. How to Visualize a Decision Tree Model?
  4. A Gentle Introduction to Decision Tree in Machine Learning

Rating: 1 out of 5.

Leave a Reply