
The Manhattan distance, also known as the Taxicab distance or the L1 norm, is a metric in which the distance between two points is calculated as the sum of the absolute differences of their Cartesian coordinates. It is a more accurate reflection of true distance when movement is restricted to a grid (as a taxi would be in Manhattan, hence the name).
In this article, we will go over how to calculate Manhattan distance in Python, starting with basic principles, then providing a Python function to achieve this, and eventually scaling it up for usage in machine learning applications.
Prerequisites
To understand this guide fully, you need some familiarity with Python and basic concepts in mathematics (particularly, geometry and vectors).
We’ll be using Python’s built-in functions, as well as functionalities from the numpy
and scikit-learn
libraries. If you don’t have these libraries installed, you can do so with pip:
pip install numpy
pip install scikit-learn
Defining Manhattan Distance
The Manhattan distance between two points in a 2D plane is the absolute difference in their X-coordinates plus the absolute difference in their Y-coordinates. If we have two points P1(x1, y1) and P2(x2, y2), the Manhattan distance between these points is given by:
|x1 – x2| + |y1 – y2|
This formula can be generalized to n-dimensional space as:
Σ |ai – bi|
where ai
and bi
are the ith
components of points A and B respectively.
Calculating Manhattan Distance in Python
Manhattan Distance in a 2D Plane
Let’s start by creating a Python function that calculates the Manhattan distance between two points in a 2D plane.
def manhattan_distance_2D(point1, point2):
return abs(point1[0] - point2[0]) + abs(point1[1] - point2[1])
point1 = [2, 3]
point2 = [5, 7]
print(manhattan_distance_2D(point1, point2)) # output: 7
In this function, point1
and point2
are lists representing the x and y coordinates of the two points. The function computes the absolute differences in the x and y coordinates and returns their sum.
Manhattan Distance in an n-Dimensional Space
The formula for Manhattan distance extends to more than just 2 dimensions. Here is a function that calculates the Manhattan distance between two points in an n-dimensional space:
def manhattan_distance_nd(point1, point2):
return sum(abs(a - b) for a, b in zip(point1, point2))
point1 = [2, 3, 1]
point2 = [5, 7, 3]
print(manhattan_distance_nd(point1, point2)) # output: 9
Here, we’re using the built-in zip
function to pair the corresponding elements of the two points. The sum
function sums the absolute differences of these pairs.
Using Numpy for Efficient Calculations
In practice, you will likely be dealing with large amounts of data, and efficiency will become important. Numpy, a powerful library for numerical computation in Python, can make these calculations much more efficient.
Here’s how you can use numpy to calculate Manhattan distance:
import numpy as np
def manhattan_distance_nd_numpy(point1, point2):
return np.sum(np.abs(np.array(point1) - np.array(point2)))
point1 = [2, 3, 1]
point2 = [5, 7, 3]
print(manhattan_distance_nd_numpy(point1, point2)) # output: 9
Numpy’s operations are vectorized, which means they operate on arrays (vectors) element-wise. This makes numpy’s computations significantly faster for large datasets compared to standard Python.
Applying Manhattan Distance in Machine Learning
The Manhattan distance is a useful tool in many areas, including machine learning, specifically in clustering and classification algorithms such as K-Nearest Neighbors (KNN) and K-Means.
Scikit-learn is a popular library for machine learning in Python and conveniently, it includes functionality to compute Manhattan distance, among other metrics.
Here’s how you can compute the Manhattan distance between two points using scikit-learn’s manhattan_distances
function:
from sklearn.metrics.pairwise import manhattan_distances
import numpy as np
point1 = np.array([[2, 3, 1]])
point2 = np.array([[5, 7, 3]])
print(manhattan_distances(point1, point2)) # output: [[9.]]
The manhattan_distances
function expects 2D arrays, so we need to provide our points as such.
It is important to note that manhattan_distances
function can also calculate the pairwise distances between multiple points at once. For instance:
from sklearn.metrics.pairwise import manhattan_distances
import numpy as np
points = np.array([[2, 3, 1], [5, 7, 3], [2, 1, 3], [5, 4, 1]])
print(manhattan_distances(points))
This will output a pairwise Manhattan distance matrix for each pair of points in the array.
Conclusion
In this article, we have introduced the Manhattan distance, a fundamental concept in geometry and machine learning, and have shown how to calculate it in Python, both in basic Python and using the numpy and scikit-learn libraries.
Understanding and implementing such distance measures is essential in the field of data science and machine learning, as many algorithms rely heavily on distance computations. While we have focused on Manhattan distance here, there are many other distance measures (e.g., Euclidean, cosine) that may be more suitable depending on the problem at hand, so it’s beneficial to understand the differences and know how to implement each.