A Gentle Introduction to Random Forest in Machine Learning

Spread the love

What is Random Forest in Machine Learning?

Random Forest is an ensemble method to train multiple Decision Trees, each on a different random subset of the training set. And to make predictions we take the predictions from each individual trees and then predict the class that gets most votes.

How does the Random Forest algorithms works in Machine Learning?

Step 1 – Create a Bootstrapped dataset

To create a Random Forest we first create a bootstrapped dataset that is the same size as the original dataset. We randomly select samples from the original dataset and we are allowed to pick the same sample more than once.

Step 2 – Create a Decision Tree

Next we create a Decision tree using the bootstrapped dataset but we only select a random subset of features at each step.

Suppose we have 4 features. So instead of considering all 4 features to figure out how to split the root node, we randomly select only 2 features. Let’s say feature1 and feature2. Then we build a decision tree using each of these features and see which performs the best and select the feature which performs the best in separating the samples. Let’s say it’s the feature1. In the next step we again randomly select 2 features out of the remaining 3 features for splitting the data. We build the remaining decision tree as usual. We only consider a random subset of features at each step.

Step 3 – Repeat step 1 and step 2 to build more Decision Trees

In step 3 we repeat the step 1 and step 2 to build more decision trees by making a bootstrapped dataset and only selecting a random subset of features at each steps. By default we do this 100 of times. Making decision trees this way helps us make wide varieties of decision trees. And this variety is what makes Random forest more effective than a individual decision tree.

Step 4 – Making Predictions

Now when we want to make predictions, we run down the data through the first decision tree and keep track of the prediction made by this tree. Then we again run down the data through the second tree and record the prediction. We do this for all the decision tree. After running down the data through all the trees in the Random forest, we pick the class which receives most votes.

What is Out of Bag Samples ?

When we build a Random Forest, typically 1/3 of the original data does not end up in the bootstrapped dataset. Since out of bag data was not used to create the tree, we can use this data to test our model. The proportion of out of bag samples that were incorrectly classified is called out of bag error.

How to Train a Random Forest Classifier Model in Sklearn ?

Let’s read a dataset to work with.

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv'
df = pd.read_csv(url)
df.head()

Next split the data into a training and test set.

from sklearn.model_selection import train_test_split

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now, train a Random Forest Model and measure the accuracy.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# create a Random Forest Model
rf = RandomForestClassifier(random_state=42)
# train it on the training data
rf.fit(X_train, y_train)
# make predictions on the test set
y_pred = rf.predict(X_test)
# measure accuracy
accuracy_score(y_test, y_pred)
# output
0.9649122807017544

Rating: 1 out of 5.

Leave a Reply