How to Calculate Levenshtein Distance in Python

Spread the love

Introduction

The Levenshtein distance, also known as the “edit distance,” is a measure of the difference between two sequences. Invented by the Russian scientist Vladimir Levenshtein, it calculates the minimum number of edits (insertions, deletions, or substitutions) required to transform one string into another.

In this article, we’ll explore how to calculate Levenshtein distance in Python, including both a manual implementation and the usage of a pre-built library.

Manual Implementation

Step-by-Step Process

  1. Initialization: Create a matrix with one row for each character in the first string (plus one extra) and one column for each character in the second string (plus one extra). The matrix’s size will be (len(string1)+1)*(len(string2)+1). Initialize the first row with numbers 0 through len(string1), and the first column with numbers 0 through len(string2). These represent the cost of turning the string into an empty string by deleting all characters.
  2. Matrix filling: Iterate over the matrix, starting from the second row and column. For each cell, calculate and fill in the minimum of the following:
    • The value of the cell diagonally above and to the left, plus 1 if the current characters in the two strings are not equal (0 if they are equal). This represents a substitution.
    • The value of the cell above plus 1. This represents a deletion.
    • The value of the cell to the left plus 1. This represents an insertion.
  3. Final result: The Levenshtein distance is in the cell at the bottom-right corner of the matrix, representing the minimum edit distance between the two strings.

Python Code for Manual Implementation

def levenshtein_distance(string1, string2):
    size_x = len(string1) + 1
    size_y = len(string2) + 1

    matrix = [[0]*size_y for _ in range(size_x)]
    
    for x in range(size_x):
        matrix [x][0] = x
    for y in range(size_y):
        matrix [0][y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if string1[x-1] == string2[y-1]:
                substitution_cost = 0
            else:
                substitution_cost = 1

            matrix [x][y] = min(
                matrix[x-1][y] + 1,                   # deletion
                matrix[x][y-1] + 1,                   # insertion
                matrix[x-1][y-1] + substitution_cost  # substitution
            )

    return matrix[size_x - 1][size_y - 1]

Using a Python Library

While the manual implementation is instructive, it may be more practical to use a Python library that has the Levenshtein distance calculation built-in. One such library is python-Levenshtein. This is not a standard Python library, so you might need to install it using pip.

pip install python-Levenshtein

You can then use this library to calculate the Levenshtein distance like so:

import Levenshtein as lev

distance = lev.distance('kitten', 'sitting')
print(distance)  # Outputs: 3

The lev.distance() function does exactly what our manual implementation did. It’s also more performance-optimized, so it’s a good choice for larger applications.

Conclusion

The Levenshtein distance is a valuable tool in many fields, including natural language processing, DNA comparison, and spell checking, to name just a few. By understanding the concept and knowing how to implement it in Python, you have added a versatile tool to your data processing arsenal.

Leave a Reply