I recently wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently. In this post I’ll cover the Levenshtein Word Distance algorithm which is a related concept measuring the “cost” of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.
The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.
The Levenshtein Word Distance
I usually leave the program’s output until near the end of each post but this time I’ll put it up front as it is the easiest way of showing how the algorithm works. I’ll actually show two program outputs, the first being after the required data structure has been initialized but before the algorithm has been run, and the second after the algorithm has been run.
Here we have a hypothetical situation where somebody has typed “banama” and we want to see how close it is to “banana”. The source word “banama” is shown down the left and the target word “banana” is along the top. We also have a grid of integers with one more row than the number of letters in the source, and one more column than the number of letters in the target.
Most values are initially set to 0, but the first row of numbers represent the cumulative number of letters which need to be inserted to form the target, and the first column shows the cumulative number of deletions to remove the source. In other words, to transform “banama” into “banana” you could delete six letters and insert six more. Obviously that’s not the most efficient way though as only one operation needs to be done, the substitution of the “m” with an “n”. Let’s now look at the program output after the algorithm has been run.
All the other numbers have been filled in, and the number in the bottom right is the total cost of transforming “banama” to “banana” using the minimum number of operations, in this case 1.
The values for each of the 36 numbers initialized to 0 are calculated as follows:
- Find the value diagonally top-left and if the corresponding letters are different add 1 (substitution cost)
- Find the value above and add 1 (deletion cost)
- Find the value to the left and add 1 (insertion cost)
- Set the current value to the lowest of the above three values
For this project I will use 1 as the “cost” of all three possible operations: substitution, insertion and deletion. However, this is not mandatory and in some circumstances different costs might be considered appropriate. For example in a spell checker you might feel someone is more likely to type the wrong letter than to miss out a letter or type an extra letter. Setting the insert and delete cost to > 1 will therefore make words with the same number of letters more likely to be suggested as alternatives than words where inserts or deletes are required.
The overall plan of the implemention of Levenshtein’s Word Distance should now be clear — given two words we just need to create a suitably sized 2D list, initialize the numbers and then iterate the rows and columns, setting the elements to their final values.
This project consists of two files which you can clone or download from the Github repository.
This is the first file.
__init__ method simply creates a few attributes in our class with default values.
calculate method is central to this whole project, and after calling
__init_grid it iterates the source word letters and then the target word letters in a nested loop. Within the inner loop it calculates the substitution, deletion and insertion cost according to the rules described above, finally setting the grid value to the minimum of these. After the loop we set the minimum cost to the bottom right value.
print_grid method is necessarily rather fiddly, and prints the words and grid in the format shown above. The separate
print_cost method is much simpler and just prints the cost with a suitable message, or a warning if it has not yet been calculated.
__init_grid function called in calculate firstly empties the grid for those circumstances where we are reusing the object. It then loops the rows and columns, setting the first values to 1, 2, 3… and the rest to 0. Finally minimum_cost is set to -1 as an indicator that the costs have not yet been calculated, and is used in print_cost.
Let’s now move on to main.py.
Firstly we create two lists of word pairs to run the algorithm on, and then create a
Levenshtein object. Then we iterate the lists, setting the words and calling the methods.
Run the code with this command.
I have already included the banama/banana output above so won’t repeat it here. The word distance there was 1, so banana easily qualifies as a sensible suggestion if somebody mis-types banama. However, if you typed banama you wouldn’t expect your word processor to suggest elephant, so let’s try those two words.
The word distance here is 7 as banama needs to be almost completely re-typed to get to elephant. Clearly not a valid alternative.
Finally levinstein/levenshtein. One substitution and one insertion give us a word distance of 2, also a sensible suggestion. You would probably consider word distances of 2 or perhaps 3 to be reasonable for alternative suggestions, but no more.