In Random Forest, each decision tree is trained using a bootstrapped subset of observations. This means that for every tree there is a separate subset of observations not being used to train that tree. These are called out-of-bag (OOB) observations. We can use OOB observations as a test set to evaluate the performance of our random forest.
For every observation, the learning algorithm compares the observation’s true value with the prediction from a subset of trees not trained using that observation. The overall score is calculated and provided a single measure of random forest’s performance. OOB score estimation is an alternative to cross validation.
In Scikit-learn, we can get OOB scores of a Random Forest by setting oob_score = True in the Random Forest object (i.e. RandomForestClassifier). The score can be retrieved using oob_score_.
Let’s read a dataset to illustrate it.
import pandas as pd import numpy as np url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv' df = pd.read_csv(url) df.head()
Next split the data into a training and test set.
from sklearn.model_selection import train_test_split X = df.drop('diagnosis', axis=1) y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now train a random forest classifier and set the oob_score=True
from sklearn.ensemble import RandomForestClassifier # create a random forest object rf = RandomForestClassifier(random_state=42, oob_score=True) # train it on training set rf.fit(X_train, y_train) # get the oob score rf.oob_score_
# output 0.9516483516483516
Related Posts –
- A Gentle Introduction to Random Forest in Machine Learning
- How to Train a Random Forest Regressor in Sklearn?
- How to Identify Important Features of a Random Forest Model?
- How to Select Important Features of a Random Forest Model?