Spell check and correction[NLP, Python]

Yash Jain
4 min readFeb 19, 2022

--

In Natural Language Processing it’s important that spelling errors should be as less as possible so that whatever we are making should be highly accurate. There are libraries that does this tedious task, instead of you to do all checking and correction.

We’ll use levenshtein distance, Hamming distance, Needleman-Wunsch to check accuracy of output.

Where and why to use

— While having conversation with chatbots type/spell error happens and therefore context understanding becomes difficult, this is where spell correction can come handy

— OCR post-processing — Till now no ocr gives 100% accurate results, there is always some misspell happens.

— Fuzzy search & approximate string matching is another field where spell check/correction can be used. and there are many more applications.

Libraries we will be using:

  1. Jamspell pip install jamspell is a modern spellchecking library. It is light-weight, fast and accurate. It consider word surroundings to make better corrections. It has following features:
    It considers words surroundings (context) for better correction
    Nearly 5K words per second
    Multi-language →it’s written in C++ and available for many languages with swig bindings
  2. Symspellpy pip install symspellpyThe Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time.
  3. Textblob pip install textblob textblob’s spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library.

Let’s use some sample para and induce some spell errors

Error induced paragraphs:

para_1:

para_1 = “wherre is the love hehad dated forImuch of the past who couqdn’tread in sixthgrade and ins pired him”

para_2:

para_2 = """As far as I am abl to judg, after long attnding to the sbject, the condiions of lfe apear to act in two ways—directly on the whle organsaton or on certin parts alne and indirectly by afcting the reproducte sstem. Wit respct to te dirct action, we mst bea in mid tht in every cse, as Profesor Weismann hs latly insistd, and as I have inidently shwn in my wrk on "Variatin undr Domesticcation," thcere arae two factrs: namly, the natre of the orgnism and the natture of the condiions. The frmer sems to be much th mre importannt; foor nealy siimilar variations sometimes aris under, as far as we cn juddge, disimilar conditios; annd, on te oter hannd, disssimilar variatioons arise undder conditions which aappear to be nnearly uniiform. The efffects on tthe offspring arre ieither definnite or in definite. They maay be considdered as definnite whhen allc or neearly all thhe ofefspring off inadividuals exnposed tco ceertain conditionas duriing seveal ggenerations aree moodified in te saame maner."""

para_3:

para_3 = """Cinderella came frm a grea family. She is the only daughter of an affluent and widowrr duke who has rewed to provide her witha stepmom and two stepsistrs. Cinderella’s mother died due to illness when she was stilll a younng girl, leawing her with a doll, faworite dress, and a pair of glasss slipppers."""
  • Jamspell
!wget https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz
!tar -xvf en.tar.gz
import jamspell
jsp = jamspell.TSpellCorrector()
assert jsp.LoadLangModel('en.bin')
jsp.FixFragment(para_1)
jsp.FixFragment(para_2)
jsp.FixFragment(para_3)

Check output of all three para, Correct output and similarity metrics below.

Note: You might face problem in installing and running jamspell, so i have made a docker container that exposes jamspell package as API, that you can run on your local machine. You can find instructions here on how to run docker image and make a request.

  • Symspellpy

Here I have loaded freq_dictionay_symspellpy.txt which is used as a corpus of words. Data is in form of 2 columns separated by space, 1st column is word, 2nd column is frequency of that word.

freq_dictionay_symspellpy.txt

You can use your own corpus. Corpus used in code can be found here

from symspellpy import SymSpell
symsp = SymSpell()
symsp.load_dictionary('freq_dictionay_symspellpy.txt',\
term_index=0, \
count_index=1, \
separator=' ')

Now as we have loaded our corpus of correct word in symsp let’s try spell correction of misspell words.

terms = symsp.lookup_compound(para_1,
max_edit_distance=2)
print(terms[0].term)
terms = symsp.lookup_compound(para_2,
max_edit_distance=2)
print(terms[0].term)
terms = symsp.lookup_compound(para_3,
max_edit_distance=2)
print(terms[0].term)
#max_edit_distance is the number of characters that can be #mismatched , you can say number of wrong characters it can tolerate

Output

Check output of all three para, Correct output and similarity metrics below.

  • TextBlob
from textblob import TextBlobprint(str(TextBlob(para_1).correct()))
print(str(TextBlob(para_2).correct()))
print(str(TextBlob(para_3).correct()))

Check output of all three para, Correct output and similarity metrics below.

Below matrix we made: yellow are the high score (Higher the better) among symspell, jamspell, textblob against para1, para2, para3 and measuring it with 3 different similarity measures with help of Damerau–Levenshtein, Hamming distance, Needleman-Wunsch. But it is just a rough estimation, you can check string output below.

We have looked at three different spell correction libraries, every output has some erroneous text remains. A metrics for comparison. There is little difference in every package output. You might have to experiment and go through algorithm behind the packages to pick library that suits your need.

--

--

Yash Jain

Data Scientist/ Data Engineer at IBM | Alumnus of @niituniversity | Natural Language Processing | Pronouns: He, Him, His