# How to Calculate the Area Under the Curve (AUC) in Python

The Area Under the Curve (AUC) of a Receiver Operating Characteristics (ROC) curve is a way to reduce ROC performance to a single value representing expected performance. It’s commonly used in Machine Learning to compare different models and to choose the best one. This guide provides a comprehensive overview of how to calculate the AUC in Python.

## Part 1: Understanding AUC

Before diving into the computation of AUC, let’s understand what it is and why it’s used.

An ROC curve is a plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The y-axis represents the True Positive Rate (TPR), and the x-axis represents the False Positive Rate (FPR). The AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds.

AUC ranges in value from 0 to 1. A model with perfect predictive accuracy would have an AUC of 1, and a model that makes random predictions would have an AUC of about 0.5. Therefore, the higher the AUC, the better the model’s ability to distinguish between positive and negative classes.

## Part 2: Computing AUC in Python

Having understood the theory behind AUC, let’s calculate the AUC for a machine learning model in Python. We will use the Breast Cancer Wisconsin dataset from Scikit-learn for this purpose.

### Step 1: Import Necessary Libraries

First, we need to import the necessary libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

### Step 2: Load and Preprocess Data

We’ll load the Breast Cancer Wisconsin dataset, which comes with Scikit-learn. This dataset has 30 features and a binary target variable indicating whether the breast cancer is malignant or benign.

# Load the dataset

# Extract features and target
X = data.data
y = data.target

We’ll split the data into training and testing sets. The model will be trained on the training set and evaluated on the testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Train the Model

Let’s train a Logistic Regression model on our training data.

# Create and train the model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

### Step 4: Predict Probabilities and Calculate AUC

The ROC curve requires probabilities of the positive class, not the predicted outputs. Therefore, we use model.predict_proba() instead of model.predict(). The predict_proba() function returns probabilities for both classes. We slice out the probabilities of the positive class.

Then, we calculate the AUC using roc_auc_score.

# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate AUC
auc = roc_auc_score(y_test, y_prob)
print('AUC: ', auc)

## Conclusion

In this guide, we learned how to compute the Area Under the Curve (AUC) in Python. AUC is a useful metric for evaluating the performance of a binary classification model, especially in cases where the classes are imbalanced. It measures the model’s ability to distinguish between positive and negative classes. The higher the AUC, the better the model’s performance. Python’s Scikit-learn library provides convenient functions to calculate AUC, making it an excellent tool for machine learning tasks.