Written Comm Analyzer — Scoring English Texts

Sajid Rahman
Prod.IO
Published in
11 min readSep 19, 2018
Let’s face it

OVERVIEW

At KNOLSKAPE, we develop business simulations for learning and development in workplaces. At the core of these simulations, a user interacts with a virtual workplace and takes business decisions. Some of our simulations had the option to take action by sending emails or business proposals to virtual companies. The effect of these actions was reflected in the user’s scores. The scores were dependent on the simulation algorithm and were not affected by the content of those emails. We thought of changing that by analyzing the content of the emails/proposals as well and reflecting that in the score. This would enable a better judgement of the user’s actions and subsequently help generate a better report for the user.

When we thought of a communication analyzer, we were looking at different modules within the suite that would measure different independent metrics for any given text. Segmenting the entire suite helps in generating a granular report, rather than a consolidated one. The modules measure spelling accuracy, readability, sentence structure, vocabulary range and vocabulary mastery independently. We will take a look at each of these modules individually and look at the algorithms that we used to qualitatively measure each of those.

COMM ANALYZER MODULES

We will take a brief look at these modules and then later dive into the more technical aspects.

  1. Spelling accuracy — Spelling is the most basic parameter that one has to be really careful about. This module had to be built to recognize misspellings as well as suggest the closest matching word in a given text.
  2. Readability — It tells us how hard or easy it is to understand a given text. As you would have guessed, the readability level should depend on the audience and the context. Hence, the most widely used qualitative metric is the minimum school/college grade level expertise that is required to comprehend the text. For example, Harvard Law Review articles would have a readability grade equivalent to a college graduate whereas a Reader’s Digest article would be about the grade level of 6 or 7.
  3. Sentence Structure — A measure of how well is the sentence structured. We were looking at the complexity of the sentence based on whether it is in the active voice or passive, relative placement of verbs, adverbs, adjectives, sentence clause structure (simple, compound, complex), etc.
  4. Grammar — Grammar cannot be compromised in any professional communication. Qualitatively measuring grammar is a challenge given the complexity of languages. We have tried two approaches — a rule engine, deep learning to tackle this challenge.
  5. Vocabulary Range — We take a look at the range of words used in the passage. A wide vocabulary range gives the impression of a strong command over the language.

Spelling Accuracy

A basic yet salient feature to consider for analyzing texts is to look at spelling accuracy. Spelling errors could be because of a typo (eg. fro vs for), a genuine spelling unawareness (recieve vs receive) or due to context unawareness (eg. their vs there; homonyms). These categories present varying challenges in identification and rectification.

  1. Typos — They are easy to detect and rectify as typos in most of the cases have one letter that has been incorrectly placed or missed in the word. Once identified, each letter can be replaced by letters in the English alphabet one at a time until a legit word is formed. More details are given in a later section.
  2. Genuine Unawareness — Most of the words that are incorrectly spelt are because we are genuinely unaware/confused about the spelling. These mistakes are harder to rectify as they can have any combination of mistake (insertion/deletion/replacement).
  3. Context Unawareness — This is the most complicated of all. Words that have similar spelling and pronunciation but different meaning (also called homonyms) fall in this category. These are impossible to detect and rectify if words are individually selected and checked. Because, well the words are spelt damn correctly! Example:

I like pizza more then burgers.

If you are not careful, you’ll loose your phone.

To deal with homonyms, it is absolutely necessary to be aware of the context for identification. Deep Learning based solutions with Recurrent Neural networks have been shown to give state of the art results. We will be looking at the implementation later in this post.

Approach 1 — Bag of Words (BoW)

The first approach that we tried was on the BoW principle. We take every word in the passage and match it with our dictionary, that we can build by parsing a couple of large textbooks and retaining unique words. If the word matches a dictionary entry, voila! We have a match, we move to the next word. If not then we need to roll our sleeves and do some work. If you are familiar with conditional probability and tf-idf, then this might ring some intuition.

Step 1: Build a dictionary

We took a collection of books from Project Gutenberg and wrote a script that would read the books and maintain a list of all unique words along with the number of times that word was encountered. If you notice, just after the first step, we have a model ready to identify misspelt words in the text. We check every word in the passage with the dictionary list if it is present. If not, then we have found a spelling error. Rest of the steps will deal with rectifying the misspelt word with alternatives with the highest probability of replacing that word. We will discuss how we calculate the probabilities.

Step 2: Build candidates to replace the word

We shall call all the words that can replace the misspelt word as ‘candidates’. In this step, we find a list of possible candidates that can replace the word. Candidates can be found by deletion (delete a letter e.g. wisee — wise), insertion (insert a letter e.g. nt — not), replacement (replace a letter with some other letter e.g. cuppuccino — cappuccino), transposition (swap two adjacent letters e.g. beleive — believe). The code below demonstrates the implementation. This follows Peter Norvig’s algorithm for candidate creation.

The above code generates candidates that are 1-edit apart, which means words that can be generated by only one deletion/ transpose/ replacement/ insertion. For example, for a word ‘ab’ it would generate — ‘an’ (1 replacement), ‘am’ (1 replacement), ‘a’ (1 deletion) but not ‘is’ (2 replacement) or ‘and’ (1 replacement, 1 addition).

This function gives us a list of all possible candidates that can be generated using 1 edit. But of course a majority of them would be gibberish, so we need to run these candidates through the dictionary that we created in step 1, to check if they are valid candidates.

This function returns a set of all the candidates that are valid dictionary words. ‘WORDS’ here refer to a list of words with the frequency of occurrence (basically, our dictionary). Note that This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates).

Step 3: Calculate the probabilities of valid candidates

After 2nd step, we have a list of all the valid candidates that can replace the word. But how do we tell which word has the highest probability of replacing the word? Here comes conditional probability.

We are trying to find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:

argmaxc ∈ candidates P(c|w)

According to Bayes’ theorem, this evaluates to:

argmaxc ∈ candidates P(c) P(w|c) / P(w)

Since P(w) is the same for every possible candidate c, we can factor it out, giving:

argmaxc ∈ candidates P(c) P(w|c)

So, we can sort our list of valid candidates by P(c) P(w|c). So all that is left to do is to find P(c) and P(w|c).

To calculate P(c), we will use the frequency count of words in our dictionary and then divide that by the number of unique words in the dictionary. In our corpus:

There were 32,192 distinct words, which together appear 1,115,504 times, with ‘the’ being the most common word, appearing 79,808 times (or a probability of about 7%).

So P(‘the’) would give around 0.07 and other words would be less probable. Words that don’t exist in the dictionary would give a probability of 0.

Step 4: Calculate the probability that the user meant ‘c’ when he typed ‘w’ i.e P(w|c)

We used a workaround for calculating this quantity. We gave more weight to valid candidates with 1-edit over 2-edits. We placed our bets on an intuition that says that it is more probable to mistype 1 letter instead of 2 while spelling a word. This could be a little flawed, but the priorities work most of the time. To overcome this flaw, multiple spelling suggestions can be shown to the user instead of just one, so that we include not just the word rated highest probable, but also other words with lesser probabilities that could replace the word according to the user. Continuing with this intuition, we made our priorities like this:

  1. The original word, if it is known; otherwise
  2. The list of known words at edit distance one away, if there are any; otherwise
  3. The list of known words at edit distance two away, if there are any; otherwise
  4. The original word, even though it is not known.

So, all valid candidates with edit distance 1 would belong to the 2nd priority and each of those will have the same P(w|c). The first priority would be given to the same word if the word is found in the dictionary, then candidates with 1-edit, then with 2-edits and lastly, the same word even if the word wasn’t found in the dictionary.

Using this methodology, we checked that spell suggestion accuracy was close to 75% if we chose the candidate with the highest P(w|c), but the accuracy shoots to around 90% if we suggest a list of top 3 suggestions. This means that 90% of the time the intended word ‘w’ appeared in the top three suggestions.

Step 5: Finally combine the probabilities and generate a suggestion

We calculated P(c) and P(w|c) in steps 3 and 4 respectively. Now we just multiply those and create a list of suggestions. Some observations regarding multiplying P(c)P(w|c):

  • For the same edit distance, candidates with a higher occurrence in the corpus would be given preference over candidates with the lower occurrence.
  • Irrespective of P(c), candidates with 1-edits will be preferred over candidates with 2-edits.

The 2nd observation has a scope of improvement. Although as has been noted, we could achieve 90% accuracy by suggesting the top 3 words instead of 1.

So finally, we have a module that can identify misspelt words and offer suggestions to correct them. This module works pretty well for the simplicity. We have just created a corpus of English novels, took out the unique words with their frequencies and added a little bit of conditional probability. But it harbours a major flaw of not looking at the context of the word. The module doesn’t care about the rest of the words in the sentence while identifying and rectifying a word. This is a major pitfall as it can’t tackle the category-3 spelling errors as mentioned above.

For this we needed a technique, that would keep other words in the sentence in context and then proceed to identify a misspelt word. This led us to check Recurrent Neural Networks as this architecture of Neural networks are used to build LSTM (Long Short Term Memory) modules that are widely used in NLP problems which require contextual analysis. We shall look at it in the next Chapter.

Approach 2 — LSTMs

LSTMs are widely used to process time series data like text data, audio data, that requires persistence of information of the preceding stream of data. LSTMs are a huge topic to cover in a short blog. They require at least few weeks of tutorials, readings and implementation practices after covering Deep Learning and RNNs. We will try to provide a basic intuition to understand how LSTMs can persist context while processing a sequence (sequence of words in our case).

Images are taken from the following brief tutorial about how LSTMs work. We recommend reading this article to have a better intuition.

A Recurrent Neural Network Unit

Those who are familiar with Neural Networks would recognize that the image has an activation function A that takes a set of input features X and outputs an h. What is interesting in this unit is that there is a loop that takes the output of A and feeds it back into the next run of A along with X. So in the next ‘run’ A takes both X2 and the output of previous run h1 to compute h2. This enables using the output of previous streams of data to compute the output for the current stream. Hence indirectly we could say that the unit is persisting previous data to affect what the present computation outputs. This is the basic intuition behind RNNs. A sequence of RNNs put together to create an LSTM. These are long chains of short-term memories stored in the RNN, that provide a longer persistence of data.

LSTM

RNNs are pretty good to retain information that preceded just a while ago, but as the gap increases, it becomes difficult to retain complete information. This is where LSTMs make an entry. They elegantly combine a sequence of RNNs to tackle the long dependency problem. A brief mathematical explanation is given in the tutorial link also posted above.

We are using LSTMs to retain the information of previous occurring words to identify current words as correctly-spelt or misspelt and then later rectify it if misspelt. LSMTs’ ability to retain long-term context enabled us to identify the words belonging to the 3rd category of spelling errors.

Sequence to Sequence Encoding

Encoding is an essential step that prepares data to be fed into a Neural Network. Encoding simply means transforming one type of data to some other type that can be processed by the network. For example, in our case, textual data must be transformed into numbers that can be fed into the network. These ‘numbers’ are fixed length that can represent either a sequence of letters or words.

Sequence to Sequence modelling involves a certain architecture to encode data before sending it to the network and later decode the data coming from the network to make sense of it. We highly recommend you go through the tensorflow tutorial (link above) to understand more about the architecture and the encoding/decoding.

LSTMs for spell correction

We used LSTMs for our use case in an indirect way. LSTMs have been trained to predict the next word in an incomplete sentence. For example, in mobile phones, the keyboard gives suggestions for the next word that can be inserted at the current cursor position. We had the intuition that if we use a similar model for spell checking, then it can correctly predict the word out of a homonym set that is most appropriate for the context.

Training Corpus — Project Gutenberg, News Crawl Data

After 2 weeks of training the model using GPUs, we achieved a cross-validation score of 72%. This wasn’t enough to be a stand-alone spell checker module, so we used both solutions (bag of words & LSTM) in a pipeline for spell identification and correction. The LSTM provides a fair amount of correction wrt homonyms, while the BoG approach tackles typos and genuine spelling errors quite well.

The LSTM approach can be highly improved by more training on a larger dataset comprising of a sizeable usage of homonyms. The BoG solution can be improved linearly with the dictionary size. More the number of unique words it knows, the better suggestions it will be able to make.

--

--