Rethinking spelling correction in the 21st century
I implemented my first spelling corrector years ago based on Peter Norvig’s excellent tutorial — a spelling corrector in 21 lines of Python code.
And it sucked.
So I tried to fix it. I piled on double metaphone phonetic similarity, unicode support, multi-word expressions, weighted Damerau-Levenshtein edit-distance, efficient Trie and smart caching.
And it still sucked.
And the reason it failed to go beyond a simple toy is very simple — I am not Google (and neither are you).
Even in its simplest form, spelling a short word took a long time — say ~0.1 seconds. That might be ok for some uses, but when you are dealing with real-time chat, faced with the upcoming bot gold rush, interactivity reigns supreme. And don’t forget that spelling is only a single chain in the long processing pipeline of delivering value via Natural Language Understanding, dialog management, business logic and of course the actual application.
The root cause of the abysmal performance is that the speller is trying to brute-force its way to the right solution. Here is the core of Norvig’s code:
The function looks at every possible edit to the input — a deletion of any character, a transposition of any 2 adjacent characters, replacing any character in the input with a random character or simply inserting a random character. And for each of the results in the set of edited strings — it calculates every possible edit again!
The result is a challenging amount of computation that needs to happen, and which grows exponentially with respect to the length of the input string.
IANA-Neuroscientist but I’m pretty confident our brain does not spell correct this way. We look at jumbled words and we automatically compensate for the noise.
Can yu read this massage despite thehorible sppeling msitakes?
I bet you kan.
Sometimes this process happens subconsciously and so intuitively that we miss the fact that there is a mistake in the text at all.
Do you think our brain implements edit-3 distance functions?
There must be a better way than brute force.
There must be a way for a computer to learn this “intuition”.
So I tried a different approach. It’s called Deep Learning — a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers composed of multiple non-linear transformations (I’m quoting Wikipedia here, sorry…).
Deep learning was popularized recently when an AI named AlphaGo defeated the top Go player in the world in 4 out of 5 games.
The specifics of my approach are less important as I think many different Deep Learning architectures can achieve similar results due to the fact that they are all based on the same premise:
- Build an artificial network of sufficient “computational power” (artificial neurons, connectivity)
- Feed it a lot of data until it figures out what to do with it.
Nonetheless, here are some technical details:
- Open source tools: Python / Theano / Keras
- Did not use Google’s TensorFlow as it is a pain to install on EC2 and is slower(?!)
(Funny thing — Google is marking “TensorFlow”, the term it coined, as a spelling mistake in my Chrome. The irony).
- I used a sequence-to-sequence architecture as the input size can be different than the output size. The input size can also vary greatly.
This is very similar to models used for Machine Translation tasks.
- Character based; I preprocessed very little and kept the 75 most popular characters.
Yes, this means digits and some punctuation as well.
Yes, the embedded “knowledge” of the system is basically a character based language model. I don’t see a reason to tokenize the input (break it up into words) adding noise to the process and applying “feature engineering”. IANA-Linguist either and my model learns features much better than I ever could design them.
Besides — How would you tokenize “Whereis th elove”?
There is a reason the space bar is so much larger than other keys on most keyboards — people tend to misplace it.
I did reverse the order of the input because it generates many short term dependencies between the “question” and the target “answer” which made bootstrapping the optimization problem easier.
- Long Short Term Memory layers (2 for the encoder and 2 for the decoder) with plenty of Dropout (0.3).
- Gaussian initialization scaled by fan-in (He et al.)
- Adam, stochastic optimization.
- My objective was Categorical Cross-entropy: Also known as multiclass logloss.
- Trained on the Amazon cloud (EC2 g2.2xlarge GPU instances).
No, I do not have multiple 32 Tesla K40 GPU setups for 3 weeks.
- It took 12 hours to achieve 90% accuracy and 3 more days to reach 95.5% on the validation set.
Total cost of training = $40.
Artificial Neural Networks are VERY loosely based on biological neurons in our brains and I am not trying to imply that this algorithm is in any way related to how our brain actually works.
You might be asking where the data came from. Data is key for spelling (and for deep learning in general) and if you are Google and have in your database thousands of common misspellings of “Britney Spears” and “Arnold Schwarzenegger” you can leverage this treasure to implement very effective algorithms.
But if you are a startup you need to cleverly and creatively improvise.
I used a billion word dataset that Google released for Language Modeling research, and injected artificial noise. The noise is the simulated spelling mistakes and the model tries to learn how to correct the input by comparing the output to the original text — an Autoencoder.
We also already have several millions of logs at Evature which I plan to use for domain adaptation.
Here are some examples of input and output:
All the examples look like this:
Q The messsed up text
A The original text
☑ The output of the algorithm, given the 1st line as input
The algorithm quickly learned the identity function:
Q and it can render further applications
A and it can render further applications
☑ and it can render further applications
But then it also learned to remove added noise:
Q he had dated forI much of the past
A he had dated for much of the past
☑ he had dated for much of the past
To replace erroneous characters:
Q Since then, the bigjest players in
A Since then, the biggest players in
☑ Since then, the biggest players in
Fix transposed characters:
Q strong summer film slatew ith plenty
A strong summer film slate with plenty
☑ strong summer film slate with plenty
And re-insert missing characters:
Q in te third quarter of last year,
T in the third quarter of last year,
☑ in the third quarter of last year,
Some of the errors are not really errors:
Q In addition to personal-injury and
A In addition to personal-injury and
☒ In addition to personal injury and
Some might say this is even better than the original:
Q had learned of Ca secret plan y Iran
T had learned of a secret plan by Iran
☒ had learned of a secret plan I ran
Yet some errors are even introduced by the algorithm…
Q post-Thanksgiving performances, but
A post-Thanksgiving performances, but
☒ post-thanks gving performances, but
Here is the heart of the algorithm in 8 lines of code.
Apples and oranges, of course. This is high level Keras code and not pure Python.
model = Sequential()
model.add(recurrent.LSTM(HIDDEN_SIZE, input_shape=(None, len(chars)), init="he_normal", dropout_W=DROPOUT_W, dropout_U=DROPOUT_U))
for _ in range(LAYERS):
model.add(recurrent.LSTM(HIDDEN_SIZE, return_sequences=True, init=INITIALIZATION), dropout_W=DROPOUT_W, dropout_U=DROPOUT_U)
- Continue training
- Optimize hyperparameters, maybe using Spearmint
- Add “attention” to the model
- Add more Regularizers.
- Publish code to Github.
- Publish IPython Notebook.
- Expand on the theory.
- Adapt to other domains — Maybe try the Reddit corpus released a few months ago.
- Try to extract real world spelling mistakes from Wikipedia as suggested by this awesome Stack Overflow answer.
- Use smoothed version of confusion matrix to generate the noise, such as the ones from Probability Scoring for Spelling Correction by Church and Gale.
From the life of a Machine Learning engineer:
6:10 am — My phone sounds an annoying alarm and it is time to get up and prepare for the new day (yeah, kids, yay).
But before I get out of bed I SSH to the server and Tmux A to see the latest numbers. Every morning is like christmas as performance improves.
- This post and task is about Engineering, not science. That’s where the state of the art is right now. The technology works and it is up to us, the lowly engineers to put is to good use.
- You might be wondering about Language support. Nothing about this task used any prior knowledge about the input language, so the algorithm should work, as is, for any language.
- I got some good results on getting from noisy characters to spelling correction but there is no reason to stop there. I see no reason why you cannot continue to higher level tasks such as Named Entity Recognition as demonstrated in one of my favorite papers Natural Language Processing (almost) from Scratch, only using noisy character streams as the input and having the model magically “handle” the spelling mistakes automatically.