Master your Lexical Processing skill in 9 steps — NLP

Published in

Analytics Vidhya

10 min readNov 28, 2019

Did you ever wonder how mail systems are able to intelligently differentiate between Spam Vs Ham (Good mails) ? or How the mobile applications are able to make similar judgement for SMSs ?

What if there are fully functional intelligent systems available that can precisely predict (i.e. classify) patients into “Risk zone”, “ill” or “Risk free” categories based on the details captured in various medical test reports ? These systems can support various ailments such as Diabetic, Cataract, Hypertension, Cancer etc…

Or there are able applications that can accurately classify a claim into “Approved”, “Denied”or “Partially approved” categories.

These are typical problems that are dependent on going through large set of text data and performing text analysis (done manually in absence of an intelligent system) which eventually provides an outcome in terms of classification into certain category.

The manual way, is not scalable solution considering the fact that there is tons of text data getting generated every minute through various platforms, applications etc..

A more sophisticated, advanced and less tiresome solution is machine learning models from the classification category. There are many different models such as Naive Bayes, SVM, Decision tree etc..are available to meet the classification objective. Each of these models also have their implementation available in different Python libraries such as Sci-Kit Learn, NLTK etc..which are distinct in terms of their implementation methods.

But each of these have a basic dependency in terms of quality of data that is supplied. It is in fact significant to supply good quality data to achieve accuracy in the results, otherwise the model just turns out to be a manifestation of Garbage in and Garbage out.

In general, there is always a high possibility to get noisy data as most of the content for such problems is user generated unstructured data in raw format. It is even possible that data would have a mix of different language text, domain specific terms, spell errors, numbers, errors related to language construct, special characters, presence of mixed or ambiguous words and many other data quality issues.

Think about an analogy from Chemistry, where various distillation methods are applied to remove impurities and produce a concentrated form of main chemical element.

Similarly, a set of pre-processing steps need to be applied before you can do any kind of text analytics such as building language models, building chatbots, building sentiment analysis systems and so on. These pre-processing steps are used in almost all applications that work with textual data. Lets look at the steps that are required to improve the quality of data or extract meaningful information from the data that can be supplied to model for classification. These steps are categorized in following few techniques within lexical processing:

Case conversion
Word frequencies and removing stop words
Tokenisation
Bag of word formation
TF-IDF Representation
Stemming
Lemmatization
Phonetic Hashing
Spelling error correction with Levenshtein Edit Distance
Converting text into lower case — Most common step is to convert the text into lower case letters unless the capitalized letters are contributing towards the accuracy of model. For ex. In spam detection, the mails will generally have capitalized words CONGRATULATIONS, BONANZA etc.. which is an indication of mail being a spam. Otherwise, it make sense to convert the text into lower case with python commands.

dataframe1= dataframe1.lower()

2. Word Frequencies and Stop Words — this step is basically a data exploration activity. Main idea here is to understand the structure of given text in terms of characters, words, sentences and paragraphs that exist in the text. The most basic statistical analysis you can do is to look at the word frequency distribution, i.e. visualizing the word frequencies of a given text content. You will be amazed to see an interesting pattern when you plot word frequencies in a fairly large set of text.

This word frequency pattern is explained by the Zipf’s law (discovered by the linguist-statistician George Zipf). It states that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on.This is also called the power law distribution.

Each text content consists of 3 type of words

Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
Significant words, which are typically more important to understand the text
Rarely occurring words, which are again less important than significant words

As a general practice, the stop words are removed because they don’t really help with any meaningful information in case of spam detector or question/answer applications. It also helps in improving execution speed.

3. Tokenisation — Even after removing stop words the input data will have continuous segments of strings and it is not possible to use the data in this format to get any important information. To deal with this problem, Tokenisation technique is used which splits the text into smaller elements or tokens. These elements can be characters, words, sentences, or even paragraphs depending on the application you’re working on. There are multiple ways of fetching these tokens from the given text.

Use the split() method that just splits text on white spaces, by default. But can’t handle contractions such as “can’t”, “hasn’t”, “wouldn’t”. For such cases advanced functions from NLTK library can be used.
Word tokeniser splits text into different words.
Sentence tokeniser splits text in different sentence.
Tweet tokeniser handles emojis and hashtags that are generally seen in social media texts.
Regex tokeniser allows you to build your own custom tokeniser using regex patterns of your choice.

4. Bag of Words (BoW) formation — It is a unique approach to form an amalgamation of words from given data after removing stop words where the sequence of occurrence does not matter. The central idea of this approach to maintain a list of all significant words that helps to achieve desired outcome such as spam detection or answering a given question. for ex. if you are asking this question to chatbot — “Suggest me cheapest flights between Bengaluru to Prague”. For BoW formation of this question, the chatbot will select words that are significant i.e. ‘cheapest’, ‘Bengaluru’, ‘Prague’.

Let’s take another common example.

Did you notice such message in the spam folder of your mailbox ? Most of the spam messages, contain words such as prize, lottery etc., and most of the correct mails don’t. Now, whenever a new mail received, the available BoW helps to classify the message as Spam or Ham.

These Bags of words need to be supplied in a numerical matrix format to the ML algorithms such as naive Bayes, logistic regression, SVM etc., to do the final classification.

To prepare this matrix, each input entry (line,sentence, document etc..) is provided into a separate row and each word of the vocabulary has its own column. These vocabulary words are also called as features of the text. Each cell of the matrix is filled in either of the 2 ways :

fill the cell with the frequency of a word (i.e. a cell can have a value of 0 or more)
fill the cell with either 0, in case the word is not present or 1, in case the word is present (binary format).

5. TF-IDF Representation — An advanced method for Bag of words matrix formation which is more commonly used by experts. It not only captures word frequency in a document but also finds out relative importance of the specific word by considering it’s presence occurrence across documents. The central concept is that if a word is occurring too frequently than it indicates less importance for the machine learning model.

This approach of considering importance of each word makes this method superior than vanilla BoW method explained earlier. The formula to calculate TF-IDF weight of a word in a document is:

t=frequency of word ′t′ in document ′d′/total words in document ′d′
idf-t=log(total number of documents/total documents that have the words ′t′)
tf−idf=t∗idf-t

Essentially, higher weights are assigned to terms that are present frequently in a document and which are rare among all documents. On the other hand, a low score is assigned to terms which are common across all documents.

However the limitation of BoW formation is that it doesn’t consolidate redundant words that are similar or have same root word such as ‘sit’ and ‘sitting’, ‘do’ and ‘does’, ‘watch’ and ‘watching’. It eventually increases the complexity of machine learning models due to high dimensions. To handle such cases, we need to apply methods that helps to reduce a word to its base form such as Canonicalisation . Stemming and lemmatization are 2 specific methods to achieve canonical form.

6. Stemming — It is a rule-based technique that just chops off the suffix of a word to get its root form, which is called the ‘stem’. For example, ‘warn’, ‘warning’ and ‘warned,’ are represented by a single token — ‘warn’, because as a feature in machine learning model they should be counted as one. There are two popular stemmers:

Porter stemmer: Basic stemmer works with English language only
Snowball stemmer: Advanced stemmer which supports additional languages such as French, German, Italian, Finnish, Russian etc..

The stemmer technique is much faster than than lemmatizer but give less accurate results.

7. Lemmatisation — More sophisticated technique that addresses more complex forms of words or inflected form of a token. It takes an input word and searches for its base word by going recursively through all the variations of dictionary words. The base word in this case is called the lemma. Words such as ‘teeth’, ‘brought’, etc. can’t be reduced to their correct base form using a stemmer. But a lemmatizer can reduce them to their correct base form. The most popular lemmatizer is the WordNet lemmatizer.

A lemmatizer is slower because of the dictionary lookup but gives better results than a stemmer as long as POS (parts of speech) tagging has happened accurately.

Even after going through all those pre-processing steps that we have seen so far,there is still a lot of noise present in the data which requires advanced techniques mentioned below.

8. Phonetic Hashing- There are certain words which have different pronunciations in different languages. As a result, they end up being spelt differently. Examples of such words include names of people, city names, food items, etc. for example Pune is also pronounced as Poona in Hindi. Hence, it is not surprising to find both variants in an uncleaned text set. Similarly, the surname ‘Chaudhari’ has various spellings and pronunciations. Performing stemming or lemmatization to these words will not be of any use unless all the variations of a particular word are converted to a common word.

To handle such words, Phonetic hashing method is used which works based on soundex algorithm. As part of this method the words are classified into a single bucket for all the similar phonemes (words with similar sound or pronunciation) and gives all these variations a single hash code irrespective of language.

The entire process follows below steps to get a 4 letter phoneme code

The first letter of the code is the first letter of the input word.
Map all the consonant letters (except the first letter) to specific codes as mentioned below.

The third step is to remove all the vowels.
The fourth step is to truncate or expand the code to make it a four-letter code. You either need to suffix it with zeroes in case it is less than four characters in length or you need to truncate it from the right side in case it is more than four characters in length.

There is always possibility that input text can have variations for words which are phonetically correct but misspelt due to lack of vocabulary knowledge or due to multiple common forms of same words being utilized across different culture. for ex. Center and Centre, Advise and Advice, Color and Colour . With inception of social media, the spelling mistakes happen by choice (informal words such as ‘ROFL’, ‘aka’ etc.) in many cases. You need to use additional pre-processing method to find the common root word for such cases.

9. Edit Distance-An edit distance is a distance between two strings which is a non-negative number. It is the number of edits that are needed to convert a source string to a target string. An edit operation can be one of the following:

Insertion of a letter in the source string. To convert ‘color’ to ‘colour’, you need to insert the letter ‘u’ in the source string.
Deletion of a letter from the source string. To convert ‘Ashoka’ to ‘Ashok’, you need to delete ‘a’ from the source string.
Substitution of a letter in the source string. To convert ‘Iran’ to ‘Iraq’, you need to substitute ’n’ with ‘q’
Transposition- It is a swap (transposition) operation between two adjacent characters which costs only one edit instead of two. This edit operation was introduced because swapping is a very common mistake. For example, typing mistake happens with some words such as ‘achieve’ to ‘acheive’. This has to be accounted as a single mistake (one edit distance), not two.

“Every solution to every problem is simple. It’s the distance between the two where the mystery lies.” ― Derek Landy

While the field of NLP is vast and numerous ways of handling text data, the above mentioned methods are the tools that help to uncover the mystery of reaching to a simple solution to address any NLP or text analytics problem.

Master your Lexical Processing skill in 9 steps — NLP

Written by Niwratti Kasture