How to Calculate Mahalanobis Distance in Python

Spread the love

Introduction

Mahalanobis distance is a measure of the distance between a point and a distribution. Unlike Euclidean distance, Mahalanobis distance considers the correlations of the data set and is scale-invariant. It was introduced by P.C. Mahalanobis in 1936, and it has been used in various fields such as pattern recognition, data mining, machine learning, and bioinformatics.

In this article, we’ll explore how to calculate Mahalanobis distance in Python, including both a manual implementation and usage of a pre-built library.

Manual Implementation

Mahalanobis distance is defined by the following formula for a multivariate vector x=(x1, x2, ..., xn)T:

D^2 = (x - μ)T Σ^-1 (x - μ)

Where:

  • D^2 is the square of the Mahalanobis distance.
  • x is the vector of the observation (row in a dataset).
  • μ is the vector of mean values of independent variables (mean of each column).
  • Σ^-1 is the inverse covariance matrix of independent variables.

Step-by-Step Process

  1. Calculate the mean: Calculate the mean of each column in the dataset.
  2. Calculate the covariance matrix: Use the numpy cov function to calculate the covariance matrix.
  3. Calculate the inverse of the covariance matrix: The inverse of the covariance matrix is used in the Mahalanobis distance formula.
  4. Subtract the mean from the observation: This is part of the Mahalanobis distance formula.
  5. Calculate the Mahalanobis distance: Apply the Mahalanobis distance formula.

Python Code for Manual Implementation

import numpy as np
from scipy.spatial import distance

# Assume X is your dataset 
X = np.array([[2, 2], [2, 5], [6, 8], [8, 8], [7, 2]])

# Calculate the mean of the dataset
mean = np.mean(X, axis=0)

# Calculate the covariance matrix
cov = np.cov(X.T)

# Calculate the inverse of the covariance matrix
inv_cov = np.linalg.inv(cov)

# Subtract the mean from the observation
obs = X[0]  # Assume we want to compute the distance for the first observation
diff = obs - mean

# Finally, calculate the Mahalanobis distance
mahalanobis_dist = np.sqrt(diff.T @ inv_cov @ diff)

print(mahalanobis_dist)  # Outputs the Mahalanobis distance

Using a Python Library

Scipy’s spatial module has a built-in function to calculate Mahalanobis distance, which is a simpler and more efficient approach for calculating Mahalanobis distance for large datasets. Below is an example of how to use it:

from scipy.spatial import distance

# Assume X is your dataset 
X = np.array([[2, 2], [2, 5], [6, 8], [8, 8], [7, 2]])

# Calculate the mean of the dataset
mean = np.mean(X, axis=0)

# Calculate the covariance matrix
cov = np.cov(X.T)

# Calculate the inverse of the covariance matrix
inv_cov = np.linalg.inv(cov)

# Create a Mahalanobis function using the parameters
mahalanobis = distance.mahalanobis

# Use the function to calculate the Mahalanobis distance of the first observation
obs = X[0]  # Assume we want to compute the distance for the first observation
mahalanobis_dist = mahalanobis(obs, mean, inv_cov)

print(mahalanobis_dist)  # Outputs the Mahalanobis distance

Conclusion

Understanding the Mahalanobis distance is fundamental in the field of data science, especially in areas where the understanding of datasets and their relationships is paramount. It’s a useful measure in fields such as anomaly detection where it is used to identify outliers in multivariate data. By knowing how to implement it in Python, either manually or using Scipy’s spatial module, you’ll be well-equipped to handle tasks requiring the measurement of the statistical distance between points and distributions.

Leave a Reply