A machine learning model to understand fancy abbreviations, trained on Tolkien
Recently I bumped into a question on Stackoverflow, how to recover phrases from abbreviations, e.g. turn “wtrbtl” into “water bottle”, and “bsktball” into “basketball”. The question had an additional complication: lack of comprehensive list of words. That means, we need an algorithm able to invent new likely words.
I was intrigued and started researching, which algorithms and math lie behind modern spell-checkers. It turned out that a good spell-checker can be made with an n-gram language model, a model of word distortions, and a greedy beam search algorithm. The whole construction is called a noisy channel model.
With this knowledge and Python, I wrote a model from scratch. After training on “The Fellowship of the Ring” text, it was able to recognize abbreviations of modern sports terms.
Spell checkers are widely used: from your phone’s keyboard to search engines and voice assistants. It’s not easy to make a good spell checker, because it has to be really fast and universal (able to correct unseen words) at the same time. That’s why there is so much science in spell checkers. This article is aimed to give idea of this science and just to make fun.
The math behind a spell checker
In the noisy channel model, each abbreviation is viewed as the result of a random distortion of the original phrase.
To recover the original phrase, we need to answer two questions: which original phrases are likely, and which distortions are likely?
By the Bayes theorem,
Here distortion is any change of the original phrase, which turns it into the observable abbreviation. The “~”symbol means “proportional”, because LHS is a probability distribution, but RHS is generally not.
Both original phrase likelihood and distortion likelihood can be estimated with statistical models. I will use the simplest models — character-level n-grams. I could use more difficult models (e.g. recurrent neural networks), but it doesn’t change the principle.
With such models, we can reconstruct probable original phrases letter by letter, using a greedy directed search algorithm.
N-gram language model
An n-gram model looks at the previous n-1 letters and estimates the probability of the next (n’th) letter conditional on them. For example, probability of letter “g” appearing after “bowlin” sequence would be calculated by 4-gram model as p(g|bowling)=p(g|lin), because the model ignores all the characters before these 4, for the sake of simplicity. Conditional probabilities, such as this, are determined (“learned”) on a training corpus of texts. In my example,
Here #(ling) is number of occurrences of “ling” in the training text, #(lin•) is number of all 4-grams in the text, starting with “lin”.
In order to estimate correctly even the rare n-grams, I apply two tricks. First, for each counter I add a positive number δ=1. It guarantees that I will not divide by zero. Second, I use not only n-grams (which can occur rarely in the text), but also n-1 grams (more frequent), and so on, down to 1-grams (unconditional probabilities of letters). But I discount lesser-order counters with an α multiplier. Thus, in fact I calculate p(g|lin) as
For those who prefer implementation to theory, there is the Python code for my n-gram model and everything that follows.
The model of abbreviations
We needed the language model to understand which original phrases are probable. We need the model of abbreviations (or “distortions”) to understand, how the original phrases usually change.
I will assume that the only possible distortion is exclusion of some characters (including whitespace) from the phrase. The model may be modified to take into account other distortion types, e.g. replacements and permutations of characters.
I don’t have a large sample to train a complex distortion model on it. Therefore, for distortions I will use 1-grams. It means, the model will just remember for each character the probability of its exclusion from abbreviation. Nevertheless, I still code it as a general n-gram model, just in case.
Greedy search for the most probable phrase
Having models of language and distortions, theoretically we can estimate likelihood of any original phrase. But for this we need to loop over all the possible (original phrase, distortion) pairs. There are just too many of them: e.g. with 27 character alphabet there are 27^10 possible 10-letter phrases. We need a smarter algorithm to avoid this near-infinite looping.
We will exploit the fact that the models are single-character-based, and will construct the phrase letter by letter. We wil make a heap of incomplete candidate phrases, and evaluate likelihood of each. The best candidate will be extended with multiple possible one-letter continuations, and added to the heap. To cut the number of options, we will save only the “good enough” candidates. The complete candidates will be set aside, to be returned as a solution in the end. The procedure will be repeated unless either the heap or the maximum number of iterations runs out.
The quality of candidates will be evaluated as log-probability of the abbreviation, given that the original phrase begins with the candidate and ends as the abbreviation itself (because the candidate is incomplete). To manage the search, I introduce two parameters: “optimism” and “freedom”. The “optimism” evaluates how the likelihood will improve when the candidate fully cover the abbreviation. It makes sense to set “optimism” between 0 and 1; the closer it is to 1, the faster the algorithm will try to add new characters. “Freedom” is the allowable loss of quality in comparison to the current best candidate. The higher the “freedom”, the more options would be included, and the slower the algorithm would be. If the “freedom” is too low, the heap may deplete before any reasonable original phrase is found.
Testing on hobbits
To really test the algorithm, we need a good language model. I was wondering, how well a model could decipher the abbreviations, if it had been trained on a deliberately limited corpus — just one book on an unusual topic. The first such book that got into my hand was “The Lord of the Rings: The Fellowship of the Ring”. Well, let’s see how well the hobbit language can help to decipher the modern sports terms.
But first we need to train the models. For the abbreviation model, I manually created a short training corpus, where 25% of consonants and 75% of vowels are abbreviated away.
I chose the 5-gram language model, after comparing average likelihood of different models on the “test set” (the end of the book). It seems that the quality of character probability prediction grows with the order of the model. But I didn’t go beyond the order of 5, because large order means slow training and application of the model.
After training, I applied the algorithm to different contractions. To begin with, I asked for “sm”, meaning “Sam”. The model recognized him easily.
“Frodo” was also deciphered from “frd” without problems.
And the “ring” too, from “rng”.
Before running “wtrbtl”, I tried the first part, “wtr”. “Water” was deciphered perfectly.
With “bottle” the model was less confident. After all, battles appear more frequently in “The Lord of the Rings” than bottles.
But in some contexts, this is exactly what is needed.
For “wtrbtl” the model proposed multiple options, but “water bottle” was the second among them.
“Basketball”, never seen before, was recognized almost correctly, because the word “basket” has occured in the training text. But I had to extend the width of the search beam from 3 to 5, to discover this option.
The word “ball” has never occurred in the training text, so the model failed to recognize “bowling ball” in the “bwlingbl”. But it proposed “bewilling Bilbo”, “bowling blow”, and several other alternatives. The word “bowling” has also never occurred in “The Lord of the Rings”, but the model somehow managed to reconstruct it with its common understanding of English language.
Generation of weird texts
In the end, just for fun, I tried to abbreviate the beginning of “The Lord of the Rings” with my abbreviation model. It looks weird.
This bok s largly cncernd wth Hbbts, nd frm its pges readr ma dscver much f thir charctr nd littl f thir hstr. Furthr nfrmaton will als b fond n the selction from the Red Bok f Wstmarch that hs already ben publishd, ndr th ttle of The Hobbit. Tht stor was dervd from the arlir chpters of the Red Bok, cmpsed by Blbo hmslf, th first Hobbit t bcome famos n the world at large, nd clld b him There and Bck Again, sinc thy tld f his journey into th East and his return: n dvntr whch latr nvolved all the Hobbits n th grat vnts of that Ag that re hr rlatd.
The language model may be used to generate a completely new text. It has some Tolkien style, but completely no sense.
Frodo would me but them but his slipped in he see said pippin silent the names for follow as days are or the hobbits rever any forward spoke ened with and many when idle off they hand we cried plunged they lit a simply attack struggled itself it for in a what it was barrow the will the ears what all grow.
Generation of meaningful texts from scratch is still beyond the reach of data science. For it requires real artificial intelligence, able to understand the complex storyline that took place in the Middle-earth.
Natural language processing is a complex mixture of science, technology, and magic. Even linguistic scientists cannot fully understand the laws of human speech. The times when machines indeed understand texts are not to come soon.
Natural language processing is also fun. Armed with a couple of statistical models, you can both recognize and generate non-obvious abbreviations. For those who want to continue my playing with data, there is a jupyter notebook with the complete code of the experiment. As for this blog, the other experiments will follow. So subscribe! :-)