
The Area Under the Curve (AUC) of a Receiver Operating Characteristics (ROC) curve is a way to reduce ROC performance to a single value representing expected performance. It’s commonly used in Machine Learning to compare different models and to choose the best one. This guide provides a comprehensive overview of how to calculate the AUC in Python.
Part 1: Understanding AUC
Before diving into the computation of AUC, let’s understand what it is and why it’s used.
An ROC curve is a plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The y-axis represents the True Positive Rate (TPR), and the x-axis represents the False Positive Rate (FPR). The AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds.
AUC ranges in value from 0 to 1. A model with perfect predictive accuracy would have an AUC of 1, and a model that makes random predictions would have an AUC of about 0.5. Therefore, the higher the AUC, the better the model’s ability to distinguish between positive and negative classes.
Part 2: Computing AUC in Python
Having understood the theory behind AUC, let’s calculate the AUC for a machine learning model in Python. We will use the Breast Cancer Wisconsin dataset from Scikit-learn for this purpose.
Step 1: Import Necessary Libraries
First, we need to import the necessary libraries.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
Step 2: Load and Preprocess Data
We’ll load the Breast Cancer Wisconsin dataset, which comes with Scikit-learn. This dataset has 30 features and a binary target variable indicating whether the breast cancer is malignant or benign.
# Load the dataset
data = load_breast_cancer()
# Extract features and target
X = data.data
y = data.target
We’ll split the data into training and testing sets. The model will be trained on the training set and evaluated on the testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train the Model
Let’s train a Logistic Regression model on our training data.
# Create and train the model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
Step 4: Predict Probabilities and Calculate AUC
The ROC curve requires probabilities of the positive class, not the predicted outputs. Therefore, we use model.predict_proba()
instead of model.predict()
. The predict_proba()
function returns probabilities for both classes. We slice out the probabilities of the positive class.
Then, we calculate the AUC using roc_auc_score
.
# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]
# Calculate AUC
auc = roc_auc_score(y_test, y_prob)
print('AUC: ', auc)
Conclusion
In this guide, we learned how to compute the Area Under the Curve (AUC) in Python. AUC is a useful metric for evaluating the performance of a binary classification model, especially in cases where the classes are imbalanced. It measures the model’s ability to distinguish between positive and negative classes. The higher the AUC, the better the model’s performance. Python’s Scikit-learn library provides convenient functions to calculate AUC, making it an excellent tool for machine learning tasks.