What is a Decision Tree and How does it Works?
A Decision Tree asks several Yes/No questions to make decisions. In General, a Decision tree makes a statement and then makes a decision based on whether or not the statement is True or False.
When a decision tree classifies things into categories it is called a classification tree. And when a decision tree predicts numerical values it is called a regression tree.
To build a decision tree first we have to decide which features to use to split the data into two parts. To do that we select a feature from our training set and build a decision tree using only this feature to see how well it performs the task.
We repeat this process with all the other features to check how well each of them performs.
Now, we needs to find a way to evaluate each of these decision tree. To evaluate the performance of decision tree, we can use several impurity measures like Gini Index, Entropy and information gain.
Let’s see how Gini Impurity works.
Gini Impurity –
The Gini impurity measures the impurity of a node. A node is pure(Gini=0) if all training instances it applies to belong to the same class.
To calculate the Gini impurity of each of the decision tree we start by calculating the Gini impurity of a single leaf.
The Gini impurity of a leaf can be calculated using the formula
Similarly we can calculate the Gini impurity of the right leaf.
And to calculate the final Gini impurity we use the weighted average of Gini impurities of both the leaves as the number of instances in each of the leaves is not going to be same.
The formula for total Gini impurity is
Now once we calculate the Gini impurity of a decision tree with one particular feature. we can do the same for all the other features.
In the end we will have several Gini impurity for each of the features. So we select the feature which has lowest Gini impurity to split the data into two parts at the top of the decision tree.
Now we follow the same process to grow the tree. We stop this process once we reach the maximum depth of the tree which is defined by max_depth hyperparameter or if we could not find a split that will reduce impurity.
How to Train a Decision Tree Classifier in sklearn ?
Let’s read a dataset to work with.
import pandas as pd import numpy as np url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv' df = pd.read_csv(url) df.head()
Now split the data into a training and a test set.
from sklearn.model_selection import train_test_split X = df.drop('diagnosis', axis=1) y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, let’s train a Decision tree classifier.
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score # create a Decision tree classifier clf = DecisionTreeClassifier(random_state=42) # fit it on the training data clf.fit(X_train, y_train) # predict on the test set y_pred = clf.predict(X_test) # measure accuracy accuracy_score(y_test, y_pred)
# output 0.9473684210526315