Evaluation of legal words in three Java open source spell checkers : Hunspell, Basic Suggester, and Jazzy

While doing some data analysis on our database of bankruptcy cases, we noticed misspelled words in some of the cases (see the header image). We want our machine learning algorithms to evaluate all cases in order to empower our collaborators ( they help our system to learn) with the best possible answers. However, if there is a misspelling in a case, there is a possibility that passages of the current case will not be even considered as an answer. From there, we decided to evaluate different spell checkers and run the best of them against our bankruptcy corpus to find those mistakes and educate our AI algorithms to deal with them.

The Stack Overflow community provides some suggestions about good Java spell checkers here and here. There is also a blog post providing a list of different commercial and free spell checkers available for Java. Unfortunately, there is a lack of supporting experiments or evidence among those suggestions. Then, we decided to make our own evaluations and post the results achieved here. Even though it is an assessment focused in legal terms, you can have a glance about the performance of the evaluated spell checkers. After this first phase, we included some other techniques and we did some tuning to the spell checkers to improve their performance. However, those steps will not be covered here. Maybe in a later blog post.

Initially, we came up with a list of 4 free ( JOrtho, Hunspell, Basic Suggester, and Jazzy) and 1 commercial (JSpell) spell checkers to evaluate. However, JOrtho and JSpell work as a spell checker for Java User Interfaces (Java Swing) and they do not provide developer friendly access if used as an API. Consequently, Hunspell, Basic Suggester, and Jazzy were selected for our experiments.


Here are more details about our selected candidates:

Hunspell

“Written in C++ but available for Java using JNA wrappers, Hunspell is the spell checker of LibreOffice, OpenOffice, Mozilla Firefox 3, Thunderbird, Google Chrome, and it is also used by proprietary software packages, like Mac OS X, InDesign, memoQ, Opera and SDL Trados.”
last update: March 22, 2016 (Maven repository)
version: 1.0.3
dictionary: download here

Try out Hunspell using the groovyConsole:

@Grab(group=’com.atlascopco’, module=’hunspell-bridj’, version=’1.0.3')
import com.atlascopco.hunspell.Hunspell
String misspelled = “banrupcty”
String hunspellDictionaryPath = “/Users/pargles/Downloads/en_US/en_US.dic”
String hunspellDictionaryTreePath = “/Users/pargles/Downloads/en_US/en_US.aff”
Hunspell hunspell = new Hunspell(hunspellDictionaryPath, hunspellDictionaryTreePath);
println hunspell.spell(misspelled)
println hunspell.suggest(misspelled)

Basic Suggester

Supposed to be one of the best open source spell checkers. Its web page provides a comparison with Jazzy, Dictionary.com, Microsoft Word 2000, and Google spell checker.
last update: August 17, 2013
version: 1.1.2
dictionary: about 200,000 words(available in the suggester-basic.zip file)

Try out Basic Suggester using the groovyConsole ( you have to add suggester-1.1.2.jar to your script. Select: Script > Add Jar(s) to ClassPath):

import com.softcorporation.suggester.BasicSuggester
import com.softcorporation.suggester.Suggestion
import com.softcorporation.suggester.dictionary.BasicDictionary
String misspelled = “banrupcty”
String suggesterDictionaryPath = “file:///Users/pargles/Downloads/suggester-basic/dic/english.jar”
BasicSuggester suggester = new BasicSuggester()
BasicDictionary suggesterDictionary = new BasicDictionary(suggesterDictionaryPath);
suggester.attach(suggesterDictionary)
println suggester.hasExactWord(misspelled)
println suggester.getSuggestions(misspelled,5)

Jazzy

“Jazzy is a 100% pure Java library implementing a spell checking algorithm similar to GNU Aspell.”
last update: November 23, 2005 (Maven repository)
version: 0.5.2
dictionary: download here

Try out Jazzy using the groovyConsole:

@Grab(group=’net.sf.jazzy’, module=’jazzy-core’, version=’0.5.2')
import com.swabunga.spell.engine.SpellDictionaryHashMap;
import com.swabunga.spell.event.SpellChecker;
String misspelled = “banrupcty”
File jazzyDictionaryFile = new File(“/Users/pargles/Downloads/english.0.txt”)
SpellDictionaryHashMap jazzyDictionary = new SpellDictionaryHashMap(jazzyDictionaryFile);
SpellChecker jazzy = new SpellChecker(jazzyDictionary)
println jazzy.isCorrect(misspelled)
println jazzy.getSuggestions(misspelled,5)

Expectations and first impressions:

Hunspell seems to be very popular among well known applications, such as: Open/Libre Office, Firefox , Chrome. So, I am expecting pretty good results for this one, probably the first place.

Basic Suggester ( will be named just Suggester) claims better results than Jazzy in its webpage, but there is no comparison with Hunspell. Are they hiding something? Or, is Basic Suggester just a humble name? I am expecting the second place for this one.

Jazzy, as mentioned above, was already beat by the Basic Suggester. But, was that evaluation a false advertisement focusing just on the week points of Jazzy?


While grammar checkers consider the context of the sentence to check its correctness, spell checkers only consider word by word. In another opportunity, we will post an evaluation of different grammar checkers. But for this task, we only need spell checkers.

In a nutshell, spell checkers are pretty simple. They contain a dictionary of words and if a word is incorrectly typed and it cannot be found in the dictionary, a list of suggestions will be provided ordered by its “distance” to a valid word. There are different algorithms to calculate this distance, but one of the most famous is called Levenshtein. This distance is measured by the number of steps needed to turn one string into the other. The higher the percentage, the closer two words are. For instance, the Levenshtein distance between banruptcy and bankruptcy is 90%, while the distance between banruptcy and bankrupt is 67%.

So, perhaps Hunspell, Basic Suggester, and Jazzy are using different dictionaries and different algorithms to calculate the distance between suggestions. Then, the questions is, how can we evaluate the performance of them? How can we find which one performs better on legal words?

There are different papers providing different techniques to evaluate spell checkers (here, and here). However, to keep things as simple as possible, a similar methodology to the TEMAA (A Testbed Study of Evaluation Methodologies: Authoring Aids), which is based on ISO 9126 specifications, will be applied.

In order to find how Hunspell, Basic Suggester, and Jazzy perform on legal terms and following similar TEMAA standards, two lists of legal words will be necessary:

Base list: It is a list of valid words. What better than a legal dictionary? I hope there are no spelling mistakes in it :P. This list contains 13,054 unique words from the Black’s Law Free Online Dictionary -2nd Edition( noun phrases and expressions will not be included, because, as mentioned before, spell checkers analyze word by word)

Error list: It is a list of corrupted words. For this one, a book called The Legal Dictionary for Bad Spellers will be used. According to the book:

“ It provides an extensive list of words that are typically misspelled ”
“ The corpus of the dictionary, …, has been collected by the authors over years among legal-brief writers and proofreaders ”

So, assuming the misspellings are in fact misspelled (we did a quick review), there is a total of 10,979 unique legal misspelled words in this list.

Using the base list and the error list, we will be able to extract the following three metrics:

Recall: This metric evaluates the words coverage or completeness of a spell checker in percentage. In summary, the base list (100% valid words) is taken and checked using the spell checker. Then, the spell checker should not find any misspelled word, because they are all correct. This means that, the higher the recall, the better the spell checker is. The impact of a low recall is annoying suggestions for words that are not wrong, also known as false positives or false alarms.

Precision: This metric evaluates how a spell checker performs finding misspelled words. This metric is similar to the recall, however, at this time, the error list (100% invalid words) is taken and checked. Next, the spell checker should detect spelling errors in all words of this list. The higher the precision, the better the spell checker is. The consequence of a bad precision is a spell checker that does not let you know that there is a spelling mistake and leave your text with errors.

Correctness: This metric evaluates if a spell checker can provide the correct word in its list of suggestions for a misspelled word. The same way as the recall and precision, the higher the correctness is, the better a spell checker is. So, supposing a spell checker has a perfect precision of 100% (can find all misspelled words), it is necessary to evaluate if the correct word is in the list of suggestions, and, depending on your restrictions in number of suggestions, it is also necessary to evaluate if the correct word is positioned inside of this limit. For this experiment, the first 5 suggestions will be checked and the suggestions adequacy metric will be the sum of percentages of those suggestions. The effect of a low correctness is a spell checker with useless suggestions and the need to search over the internet for the correct spelling.


Here it comes the most exciting part: the results.

The first bar chart shows the recall over the base list (13,054 correct legal words). As you can see, the best recall (58%) was achieved by the Suggester, close to Hunspell (56%) and superior than Jazzy (39%). The lack of many legal words in their dictionaries is clear, and consequently, their recall is considerably low. This can be fixed by including missing legal words to their dictionaries.

The next bar chart shows the precision over the error list (10,979 misspelled legal words). The precision results are impressive for all spell checkers. However, do not get biased about these numbers. As I briefly explained how spell checkers work at the beginning of this post , if a word is not contained in a spell checker’s dictionary, this word is considered misspelled. The recall chart illustrates the huge lack of legal words on Jazzy’s dictionary, and, consequently it will consider most of the words misspelled. Let’s take a look at the correctness to have a better idea on how good Hunspell, Suggester, and Jazzy are.

To evaluate the correctness and avoid discrepancy of results between spell checkers, we took a total of 10,829 words that were considered misspelled by all three spell checkers, from the original 10,979 unique legal misspelled words list.

As you can see in the pie charts, even though Jazzy achieved almost a perfect precision in the previous bar chart, it poorly performed in terms of correctness for the first 5 suggestions (~53%). Suggester achieved an impressive 84% of correctness followed by Hunspell (80%). These numbers could be even more impressive if there was a better coverage of legal words in their dictionaries.

To give a better idea of the difference in correctness among the spell checkers, the next three images shows the first 24 misspelled words and its list of suggestions for each spell checker.

While the top left image shows Hunspell’s first 24 suggestions, the top right image shows Suggester’s suggestions. As you noticed on the pie charts, Suggester gets more correct options in the first place, whereas, Hunspell finds more correct words close to the end of its list of suggestions. (e.g: occurrence, adequate, and additive). This result could be related to the algorithm used by them to calculate the distance between the words.

And the last image, on the left, shows Jazzy’s list of suggestions. As you can see, there is barely more than one suggestion per misspelling. Compared to Hunspell and Suggester, Jazzy seems like a binary spell checker, or it gets right or not. Furthermore, Jazzy’s dictionary does not seem to cover as many words as Hunspell and Suggester. For instance, while the first two could not find accusant, adiratus, and additur, Jazzy also missed accrual, actionable, accusatorial, and additive.

That’s it for now, take your own conclusions. In a next less technical post, I will show our findings at ROSS Intelligence.

Thanks for reading :), and please leave your comments, suggestions or questions !