
In this post, you will learn –
1 . What is Confusion Matrix?
2 . Train A Model
3 . How to plot Confusion Matrix in Python?
4 . How to Interpret Confusion Matrix?
- True Negative (TN)
- True Positive (TP)
- False Negative (FN)
- False Positive (FP)
1 . What is Confusion Matrix ?
Confusion Matrix helps us understand the performance of a classifier using a table. The rows in the confusion matrix represents the Actual Labels and the columns represents the predicted Labels or vice-versa.
2 . Train A Model ?
Read Data –
Let’s first read the data before moving any further and try to understand what is the overall goal of our model? What are we trying to achieve?
# setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
pd.options.display.max_columns = 999
%load_ext nb_black
%matplotlib inline
# read data in pandas dataframe
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv"
df = pd.read_csv(url)
df.head()

We have a dateset related to breast cancer patients. In this dateset, some patients are sick and some are healthy. And our job is to build a model that is capable of understanding this difference using the data. We want our model to predict which patient is sick and which is healthy as accurately as possible.
our target columns is the diagnosis
column.
df["diagnosis"].value_counts(normalize=True).round(2)

diag_prop = (
df["diagnosis"]
.value_counts(normalize=True)
.round(2)
.reset_index()
.rename(columns={"index": "diagnosis", "diagnosis": "proportion"})
)
fig = go.Figure()
fig.add_trace(go.Bar(x=diag_prop["diagnosis"], y=diag_prop["proportion"]))
fig.update_layout(
title="Healthy(B) vs Sick(M)", xaxis_title="Diagnosed", yaxis_title="Proportion"
)
fig.show()

B – B stands for Benign, Which means the cell is non cancerous. The patient is not sick.
M – M stands for Malignant, which means the cell is cancerous. The patient is sick.
So, 63% of the patients in our dateset are healthy and 37% of them are sick.
Let’s map the Malignant and Benign as 1s and 0s to reduce the confusion. All the observations related to sick patients(M) will be 1 which is our positive class and all healthy patients(B) will be 0 which is our negative class.
values = {"B": 0, "M": 1}
df["diagnosis"] = df["diagnosis"].map(values)
To build the model, we will use the Random Forest classifier, If you want you can use any other classification algorithms. And for simplicity, we will not try to make any comparison of this algorithms performance with other algorithms and neither try to do any kind of hyper-parameter tuning. We will look into these in our upcoming posts. So make sure to subscribe to our blog.
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
# split the data into training and test set
X = df.drop("diagnosis", axis=1).copy()
y = df["diagnosis"].copy()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=26
)
# initiate an rf classifier using a pipeline
clf = make_pipeline(
SimpleImputer(strategy="mean"), RandomForestClassifier(random_state=26)
)
# train the classifier on training data
clf.fit(X_train, y_train)
# make predictions on test data
pred = clf.predict(X_test)
3 . How to Plot Confusion Matrix in Python ?
We have trained our model and made predictions on the test set. Now, we can plot the confusion matrix to understand the performance of this model.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# create confusion matrix from predictions
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
y_test, pred, labels=clf.classes_, ax=ax, colorbar=False
)
plt.show()

If you want you can also customize the labels to make it easier to understand by passing the labels to the display_labels
parameter.
# create confusion matrix from predictions
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
y_test,
pred,
display_labels=["Healthy", "Sick"],
ax=ax,
colorbar=False,
)
plt.savefig("cm_plot", dpi=300) # save the plot
plt.show()

You can also create a confusion matrix using the from_estimator method instead of from_predictions. Here, you do not have to make the predictions separately before plotting the confusion matrix, scikit-learn will take care of that.
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(
clf, X_test, y_test, display_labels=["Healthy", "Sick"], ax=ax, colorbar=False
)
plt.show()

4 . How to interpret Confusion Matrix ?
Now, let’s understand how to interpret a confusion matrix.

The rows in the confusion matrix represents the Actual Labels and the columns represents the predicted Labels. The diagonal from the top to bottom (the Green boxes) is showing the correctly classified samples and the red boxes is showing the incorrectly classified samples.
1 . True Negative (TN) –

If your model accurately predicts the Negative class, then it is True Negative.
To make things easier, You can divide the True Negative into two parts. The first part says what is the actual reality and the second part says what the model has predicted. True means the model has accurately predicted the class and False means the model has made a mistake in predicting the class.
Here, the model predicted that 111 patients are Negative i.e. they are healthy and in reality these patients are healthy. So,the prediction of Negative class is correct i.e. True Negative.
2 . True Positive (TP) –

if your model accurately predicts the Positive class, then it is True Positive.
Again you can break it into two parts for visualization. The first part says what is the reality and the second part says what the model has predicted.
Here, the model predicted that 55 patients are Positive i.e. they are Sick and in reality these patients are Sick.So, the prediction of Positive class is correct i.e. True Positive.
3 . False Negative (FN) –

If your model incorrectly predicts the negative class, then it is False Negative (FN).
In this example, The model predicted that 2 patients are Negative i.e. they are healthy but in reality these patients are sick. So, the prediction of the Negative class is incorrect i.e. False Negative.
4 . False Positive (FP) –

If your model incorrectly predicts the positive class, then It is False Positive (FP).
Here,the model predicted that 3 patients are positive i.e. They are sick But in reality these patients are healthy. So,the prediction of the positive class is incorrect i.e. False Positive.
Comparing different classifier using Confusion Matrix is easier when there is a big gap in performance between classifiers. However it become difficult when they have similar performance. So, in our next post, we will learn how to use Precision and Recall to compare different classifiers and also the trade-off between them.
I hope you like this post. If you find this post helpful then please share it with others and subscribe to our blog below.
Related Posts –
1 . What is Precision, Recall and the Trade-off?