What is Precision, Recall and the Trade-off?

Spread the love

In the previous post, we learned about Confusion Matrix. In this Post, we will learn how to use Precision and Recall to compare the performance of classifiers. If you did not read the previous post, please read it because it is important to understand the Precision and Recall. I am assuming that you have read the post, now let’s get started.

In this post, you will learn –

1 . What is Precision ?

2 . What is Recall ?

3 . What is Precision/Recall Trade-off ?

4 . When to use Precision and When to use Recall ?

Related Post –

1 . Confusion Matrix – How to plot and Interpret Confusion Matrix.

Read Data –

import pandas as pd
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt

%matplotlib inline
# read data in pandas dataframe
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv"
df = pd.read_csv(url)
values = {"B": 0, "M": 1}
df["diagnosis"] = df["diagnosis"].map(values)

Here, we have data about cancer patients, in which 37% of the patients are sick and 63% of the patients are healthy. Our job is to build a model which can predict which patient is sick and which is healthy as accurately as possible.

Train A Model –

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# split the data into training and test set
X = df.drop("diagnosis", axis=1).copy()
y = df["diagnosis"].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=26

# initiate an rf classifier using a pipeline
clf = make_pipeline(
    SimpleImputer(strategy="mean"), RandomForestClassifier(random_state=26)

# train the classifier on training data
clf.fit(X_train, y_train)

# make predictions on test data
pred = clf.predict(X_test)

Plot a confusion Matrix –

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# create confusion matrix from predictions
fig, ax = plt.subplots(figsize=(8, 6))
    display_labels=["Healthy", "Sick"],
plt.savefig("cm_plot", dpi=300)  # save the plot

1 . What is Precision ?

Precision is the accuracy of the positive predictions. It tells you out of all the positive predictions we have made, how many of them we got it right.

Precision = True Positive / (True Positive + False Positive)

Precision = 55 / (55 + 3) = 94.83 %

from sklearn.metrics import precision_score
precision_score(y_test, pred)
out - 0.9482758620689655

One thing to note about precision is that It only cares about the positive predictions that we have made and the correctness of it. Suppose we 10 sick patients and we predicted that 2 patients are sick and in reality they are sick. Then this model will have 100 % precision ( 2 / (2+ 0 ) = 1)

Now, the question is – Is this a good model?

Even though the model has 100% precision, It might not be a good model because the model is only focused on these 2 people who it identified correctly and neglecting all the other sick patients. Precision tell you that if I had made a prediction and I am correct I am done, I don’t care about what happened to the others 8 patients. It is none my business. I am only interested in the correctness of my positive prediction. Which is not a good thing here. So, we also have an another metric called recall.

2 . What is recall ?

Recall is the ratio of positive instances that are correctly detected by the classifier. Recall tells you out of all actual positive class (sick patients) we have, how many of them we got it right?

Recall = True Positive / (True Positive + False Negative )

Recall = (55 / 55 + 2) = 96.49 %

from sklearn.metrics import recall_score
recall_score(y_test, pred)
out - 0.9649122807017544

Here, recall cares about correctly identifying all the 10 sick patients we have. If we only correctly predicted 2 patients then the recall will be (2 / (2 + 8) = 0.2) Which says that out of 10 patients the classifier only able to identify 2 of them which is not very good result. Recall cares about all the positive class (sick patients ) that we have and how many of them we were able to identified correctly.

3 . What is Precision/Recall Trade-off ?

The precision and Recall Trade-off say that If you increase the precision then recall will decrease and if you increase the recall then precision will decrease. We can not simultaneous increase both precision and recall.

4 . When to use Precision and when to use Recall?

In our case if we predicted that the patients has cancer but in reality he/she is not which is False positive then it might not be a big of a deal. Although patients will feel nervous and scared and may have to go through another test to understand if they actually has cancer. But now, Image that we predicted that the patient does not have cancer but it reality he/she has then it will be catastrophic which is the case of False Negative. So here we care more about False negative. We want False Negative to be as low as possible. So when you care about False Negative you should focus on getting better Recall.

But if you objective is to reduce the False positive then you should care about Precision. If you are creating an email spam classifier and You predicted that an email is spam and it is not ( False positive) then the user might missed an important email as spam email will be redirected to the spam folder instead of the inbox folder. But if you predicted that an email is not a spam and in reality it is ( False negative) then the user might have to delete it or move it manually which is not of a big problem compared to the previous one.

So if you care about False Negative then use Recall and if you care about False positive then use precision. In this next post we will talk about F1 Score which is the harmonic mean of precision and recall. So make sure to subscribe top our blog below.

1 . Confusion Matrix – How to plot and Interpret Confusion Matrix.

Leave a Reply