The Levenshtein distance, also known as the “edit distance,” is a measure of the difference between two sequences. Invented by the Russian scientist Vladimir Levenshtein, it calculates the minimum number of edits (insertions, deletions, or substitutions) required to transform one string into another.
In this article, we’ll explore how to calculate Levenshtein distance in Python, including both a manual implementation and the usage of a pre-built library.
- Initialization: Create a matrix with one row for each character in the first string (plus one extra) and one column for each character in the second string (plus one extra). The matrix’s size will be
(len(string1)+1)*(len(string2)+1). Initialize the first row with numbers
len(string1), and the first column with numbers
len(string2). These represent the cost of turning the string into an empty string by deleting all characters.
- Matrix filling: Iterate over the matrix, starting from the second row and column. For each cell, calculate and fill in the minimum of the following:
- The value of the cell diagonally above and to the left, plus 1 if the current characters in the two strings are not equal (0 if they are equal). This represents a substitution.
- The value of the cell above plus 1. This represents a deletion.
- The value of the cell to the left plus 1. This represents an insertion.
- Final result: The Levenshtein distance is in the cell at the bottom-right corner of the matrix, representing the minimum edit distance between the two strings.
Python Code for Manual Implementation
def levenshtein_distance(string1, string2): size_x = len(string1) + 1 size_y = len(string2) + 1 matrix = [*size_y for _ in range(size_x)] for x in range(size_x): matrix [x] = x for y in range(size_y): matrix [y] = y for x in range(1, size_x): for y in range(1, size_y): if string1[x-1] == string2[y-1]: substitution_cost = 0 else: substitution_cost = 1 matrix [x][y] = min( matrix[x-1][y] + 1, # deletion matrix[x][y-1] + 1, # insertion matrix[x-1][y-1] + substitution_cost # substitution ) return matrix[size_x - 1][size_y - 1]
Using a Python Library
While the manual implementation is instructive, it may be more practical to use a Python library that has the Levenshtein distance calculation built-in. One such library is
python-Levenshtein. This is not a standard Python library, so you might need to install it using
pip install python-Levenshtein
You can then use this library to calculate the Levenshtein distance like so:
import Levenshtein as lev distance = lev.distance('kitten', 'sitting') print(distance) # Outputs: 3
lev.distance() function does exactly what our manual implementation did. It’s also more performance-optimized, so it’s a good choice for larger applications.
The Levenshtein distance is a valuable tool in many fields, including natural language processing, DNA comparison, and spell checking, to name just a few. By understanding the concept and knowing how to implement it in Python, you have added a versatile tool to your data processing arsenal.