Building A Spelling Correction System With Soundex And Levenshtein Distance
With the computer tool now ubiquitous, I imagine you all use word processing tools such as Microsoft Word. If so, you’ve probably already wondered how Microsoft Word can offer you relevant words when you misspell a word in your text. In this article, I will answer this great question and show you how to build your own Spelling Correction System in Java.
In order to implement their Spelling Correction System, word processing tools generally rely on two fundamental and old concepts: the Soundex Algorithm and Levenshtein Distance.
To start, it is essential to do some theory. Soundex is a phonetic algorithm used to index words according to the sounds used to pronounce them in English. The goal is to obtain an identical representation for homophones. As a reminder, homophones are words with the same phonetic pronunciation but can be spelled slightly differently.
Soundex mainly encodes consonants. Thus, a vowel will not be encoded unless it is the first letter of the word. It should be noted that Soundex is probably the most well known phonetic algorithm and is implemented as standard in most database software from Oracle to PostgreSQL and MySQL.
In concrete terms, Soundex will produce a code for a word that consists of a letter followed by three digital digits. The letter is simply the first letter of the word used at the entrance of the Soundex. The next three digits correspond to the encoding of the remaining consonants.
To implement Soundex, you must follow the following 4 steps:
- Retain the first letter of the word. Drop all other occurences of A, E, I, O, U, Y, H, W.
- Replace consonants after the first letter with digits as follows :
B, F, P, V → 1
C, G, J, K, Q, S, X, Z → 2
D, T → 3
L → 4
M, N → 5
R → 6
- If two or more letters with the same digit are adjacent in the original word (before step 1), only retain the first letter.
- If you have too few letters in your word that you can’t assign three digits, right pad with zeros. If you have more than three digits, you must truncate.
It gives us the following implementation in Java:
We can test the execution of our Soundex implementation and see that we get the same result for “Robert” and “Rupert” with code R163:
As detailed on Wikipedia, the Levenshtein Distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein Distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
For implementing our Spelling Correction System, we will need to use the Levenshtein Distance. So, we implement it in Java as follows:
We can test the execution of our Levenshtein Distance implementation and see the value of the distance between the words “Robert” and “Rupert” for example:
Creating A Dictionary Of Homophones
The next step is to create a dictionary of homophones indexed by the Soundex code of words. To do this, we first need a dictionary of the most common English words. There are a whole bunch of them on the Internet that can be freely downloaded.
To make it simple, I use a dict.txt file containing words without accents only. Each line of the file contains a word which gives us a file with the following form:
You can download the dict.txt file just there: https://gist.github.com/ssaurel/a3c52eb19c083f78730ee496c8228f91
In order to create the dictionary of homophones, we will read each line of the dict.txt file and then calculate the Soundex code of the word. Then, we will add this word in a Map that can contain multiple values for each key.
This type of map is available in different code libraries such as Google’s Guava or Apache Commons for example. In my case, I use a custom implementation called MultiMap:
The creation of our dictionary of homophones indexed by the Soundex code of each word in our dict.txt file is therefore done as follows:
Searching For Matching Words
Now we will be able to finalize the building of our Spelling Correction System. As an entry, we will take a misspelled word as an entry to correct.
Then, we will search for the homophones of this word by calculating its Soundex code. Finally, to prun and obtain relevant results, we will only keep homophones with a Levenshtein Distance strictly less than 3 with our word to correct.
This threshold of 3 is purely arbitrary and can be lower or higher depending on the number of results you want to obtain. Finally, for proposing better results to the user, we sort the words got by descending Levenshtein Distance.
It gives us the following complete code for the Main class of our Spelling Correction System:
Our Spelling Correction System In Action
Best part of the tutorial is there since we are going to put our Spelling Correction System in action. The word mispelled is “revoluton”.
When we run our program, we get the following matching words to propose to the user:
So, our Spelling Correction System indicates us that the best result to fix the word “revoluton” is “revolution” which is a great result.
Feel free to use comments if you have some questions concerning this tutorial.