Sheldon: The Erlang spell checker

“I’m not crazy. My mother had me tested.”

Sheldon is a very simple Erlang spell checker that suggests correct words when some word is misspelled. It was created by for Inaka.

The first time I heard about Sheldon from was when we discussed ideas for creating rebar plugins. After doing some research on the project, I was impressed by the project idea. Nevertheless, several problems needed to be fixed/improved before writing a plugin for this excellent project. And thus, I started to dig deeper into Sheldon to try and improve it.

What is Sheldon?

First of all, let me show you what Sheldon can do…

1> sheldon:start().
ok
2> sheldon:check("I want to check this correct text").
ok
3> sheldon:check("I want to check this misspeled text").
#{bazinga => <<"That's no reason to cry. One cries because one is sad. For example, I cry because others are stupid, and that ma"...>>,
misspelled_words => [#{candidates => ["misspeed","misspelled"],
line_number => 1,
word => "misspeled"}]}

So… How does it work? In its first versions, Sheldon used Peter Norvig’s algorithm. This algorithm works pretty well. However, we later detected that it had one big issue: for long words (the ones including, for instance, 45 chars), it generates too many possible alternatives (11553868 in the case of 45 chars). The processing time used to compute that list may take, on average, about 9 seconds. Naturally, it is quite costly to perform such a check each time. So I decided to optimize that and, after some code adjustments, it was possible to achieve a speed of about one second for large words. For reference, other algorithms can handle 2K words in about two seconds.

Of course, with this speed, integrating Sheldon into some other project was not a very good idea. Because of this, I started trying to find another way. After some googling, I found these interesting blog articles: SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking and Исправляем опечатки с учётом контекста (in Russian). In them, the authors provide a lot of useful info.

The SymSpell algorithm is basically a collection of some other algorithms but what caught my attention about it was the indexed typos model. The idea is a pretty simple one: generate a typo model by removing each char with 1 or 2 char spaces and linking it with the correct word. Let’s say we have the correct word “text”. In this case, the indexes of this word will be txt, tet, tex, ext. Now, let’s try to check the incorrect term “tedt”. The indexes of this word will be tdt, tet, ted, edt. If we compare those two lists of indexes, we can see that tet belongs to both. Since we know that text is a correct word, that means that it should be one of our candidates.

Improving Sheldon performance with SymSpell

My first step to improve Sheldon was to replace Peter Norvig’s algorithm with a SymSpell-like one. Let’s check its impact on performance…

%% Peter Norvig algorithm for only one big incorrect word
1> LongWord = base64:encode(crypto:strong_rand_bytes(45)).
2> timer:tc(sheldon, check, [LongWord]).
{8693825, %% More than 8.5 seconds
#{bazinga => <<"I know the real reason you "...>>,
misspelled_words => [
#{candidates => [],
line_number => 1,
word =>"AA9VaG90nSWUwdEQYs+lqTBZRU1vBaxjK7iU/1..."}]}}
%% SymSpell-like algorithm checking 2k incorrect words
1> LongWord = base64:encode(crypto:strong_rand_bytes(45)).
2> timer:tc(fun() ->
[begin
{T, _} = timer:tc(sheldon, check, [LongWord]),
T
end || _ <- lists:seq(1, 2000)] end).
{1109351, %% Approximately 1.1 seconds
[2955,2360,2285,2378,2998,2711,2743,2411,2547,2756,2681,
2373,1683,2393,2229,1551,1532,1149,799,1541,1326,1130,877,
1142,834,737,802|...]}

The SymSpell-like algorithm showed a speed of 1,109351 sec. for 2k words. Really really fast, particularly when compared to the original algorithm used in Sheldon.

More Improvements

Now that we’ve reached that impressive speed, Sheldon was ready for integration into any project. But still, Sheldon needed a few more configurations to be actually useful. In a series of pull requests, I added the ones that you can find below.

Starting with a way to ignore some words or some texts entirely:

%% The word "Sheldon" is not exist in default dictionary
1> sheldon:check("Sheldon doesn't know his name").
#{bazinga => <<"I am not crazy, my mother had me tested.">>,
misspelled_words =>
[#{candidates => ["Sheldon"],
line_number => 1,word => "Sheldon"}]
%% Ignore word in text
2> Config = #{ignore_words => ["Sheldon"]}.
3> ok = sheldon:check("Sheldon doesn't know his name", Config).
%% Ignore regular expression
4> Config2 = #{ignore_patterns => ["^begin"]}.
5> ok = sheldon:check("begindddd should be ignored", Config).

By default, Sheldon is a plain-text spellchecker (i.e., it doesn’t know anything about text formatting), but one of its great features is the possibility of extending text parsing with adapters. I didn’t add this functionality, but I think it’s super-cool and worth mentioning in this article. These adapters are needed, for instance, when you need spell check HTML or markdown text. The markdown adapter is already implemented and can be used like this:

%% Ignore blocks and use adapters
1> Config =
#{adapters => [markdown_adapter],
ignore_blocks =>
[#{close => "^end-block$",open => "^start-block$"},
#{close => "^end-block1$",open => "^start-block1$"}]}.
2> MarkdownText = "start-block\nsome text\nend-block$".
3> ok = sheldon:check(MarkdownText, Config).

To create your own adapter for custom manipulation of text, you need to implement the sheldon_adapter behavior:

-module(own_adapter).-behaviour(sheldon_adapter).-export([adapt/1]).-spec adapt(binary()) -> iodata().
adapt(Line) ->
custom_manipulation:text(Line).

By default, Sheldon uses its own English dictionary. Of course, not all kinds of words are added to this dictionary, and sometimes you need to expand the dictionary with your own set of words. Because of this reason, I decided to add a new config option to be used when you need to extend the default dictionary. Now you can use this configuration in your application:

[{sheldon, [
{additional_dictionary,
["path/to/additional_dictionary.txt",
"path/to/other_additional_dictionary.txt"]}]}].

It was not possible to replace the default dictionary with a custom one in previous versions, either. I created a new config option to set a custom dictionary as the default. Now you can replace the default dictionary entirely using the following configuration:

[{sheldon, [
{default_dictionary, "path/to/your/default_dictionary.txt"}]}].

Of course, you can combine both:

[{sheldon, [
{default_dictionary, "path/to/your/default_dictionary.txt"},
{additional_dictionary,
["path/to/additional_dictionary.txt",
"path/to/other_additional_dictionary.txt"]}]}].

Future Development

Naturally, the current implementation of Sheldon still requires more love, and, of course, Sheldon will get it.

Among the future improvements, it’s worth noting that Sheldon is focused on English only, but we can add support for Unicode dictionaries in the future.

Also, together with , we created a rebar3_sheldon plugin for spell checking strings, binaries, and comments in the Erlang source code of rebar3 projects. We’ll talk about this plugin in an article “Spell check your Erlang code with Sheldon”.

Besides that, , , I, and others still have many ideas in mind on how to extend, improve or use Sheldon. But we will need help implementing them. So, if you want to join us in this effort, you can find us in the #sheldon room of the Erlang Slack. Give us a ping there, and you can become part of this fantastic project.

References