How Does Text Preprocessing In NLP Work?
What are NLP pre-processing techniques?
When your pure intent being human is to learn and educate others and when you are willing to transfer your knowledge learned from ages to these lifeless yet powerful gadgets and computers, you are making this planet a wonderful and innovative place to live in.
Welcome back folks,
This is Part-2 of my NLP series. In part 1
NLP Fundamentals For Absolute Beginners
What is natural language processing, What Is NLU & NLG?
We discussed the most basic fundamentals of natural language processing, where we covered
- What is NLP?
- What is Natural Language Understanding?
- What is Natural Language Generation?
Today, we will get into the details of how actually NLP works and what are key NLP techniques that one should understand before s/he plans to get their hands dirty with python.
Let’s Get Started……
In order to grab the concept of NLP, we need to understand these 6 steps
- Removing Stop Words
- Parts Of Speech Tagging
- Named Entity Recognition
Let’s get into the details of each step one by one
Tokenization: Is the first step in NLP text pre-processing pipeline
This is the NLP technique where the machine will process your raw text and chop it into small tokens. These small tokens can basically be the list of words, sentences, characters, numbers, etc, at the same time it involves getting rid of certain characters, such as punctuation, escape sequences, spaces, etc…
Generally, each word is a kind of token, for example, the sentence “Welcome to NLP learning “ is broken into following tokens of words: Welcome/ to / NLP/ learning
What are tokens?
Tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction. A token is an instance of a sequence of characters in some particular documents that are grouped together as a useful semantic unit for processing.
Challenges In Tokenization:
Some of the basic challenges lying in tokenization is to decide what is the best way to split/chop. One has to be smart enough to answer some the below-given question
- Will it be wise to just split on all non-alphanumeric characters, like period, space bar, etc.
- One has to decide how to treat apostrophes.
- What about splitting two-letter word like ‘West Bengal’
- What about the compound words in different languages like Sanskrit & German?
So how one should go about tokenization largely depends on the language understanding of the given document
Language identification based on classifiers that use short character subsequences as features is highly effective; most languages have distinctive signature patterns
Stop Word Removal :
As per wiki:
In the world of text mining, stop words are words which are filtered out before or after processing of natural language text. Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all-natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search
Some common stop words are :
Stop Words not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information
Spacy & NLTK python libraries are some of the popular tools to remove stop words, in the given language. They have reserved set of stop words which in their repository which is used to treat our given set of characters in order to get rid of meaningless words from the given corpus.
Normalization: Equivalence classing of terms
Normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing. By transforming the words to a standard format, other operations are able to work with the data and will not have to deal with issues that might compromise the process.
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens
There are many cases when two-character sequences are not quite the same but you would like a match to occur.
— If you search for the USA, you might hope to also match documents containing U.S.A, U.S, US, but not ‘us’
- Words like school, schools, school’s can be normalized into single word school
Why Normalization Is needed?
When we normalize a natural language resource, we reduce the possible randomness in it and bring it closer to its standardized form. This helps to reduce the amount of different information that the machine has to deal with and therefore improves efficiency.
Normalizing words into its base or root(stem) form.
Stemming usually refers to a crude heuristic process that chops off the ends of words, in order to to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form
- Nancy’s get’s stemmed to Nancy, here ‘ ’s’ got chopped
- Caresses become caress, here ‘es’ got chopped
The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm (Porter, 1980).
The most popular NLP libraries for handling the stemming process is NLTK, which is widely used for stemming. There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Both of them have been implemented using different algorithms.
While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.
Lemmatization: reducing the word to its lemma
In lemmatization, we generally look for lemma of the words,
For instance :
- Nannies become Nanny
- Initialization becomes Initialize
- Privatization becomes Private
So, we can say that:
Lemmatization usually refers to doing things properly with the use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Lemmatization is not easy to perform so we need to fall back on some of the robust libraries like spaCy which can perform lemmatization very effectively. For example, if you pass this sentence to the spaCy library :
“computer, compute, computing“
after lemmatization spaCy will give you this output:
- computer → compute
- compute → compute
- computing → compute
- computer → computer
In Part 3 of this NLP series We will continue to cover some other core concepts behind natural language processing functionality like
- POS tagging
- NER: Named Entity Recognition
We will also go hands-on to implement python code using Jupyter notebook, NLTK & spaCy library to cover all the NLP pre-processing techniques we discussed above.
So time to sign-off with this food for thought
“When the language of humans will become the language of our machines the level of communication will be magical as well as mystical. ”
See you in NLP Part 3: NLP pre-processing hands-on using Python
Here is the link :
Hands-On Lab On Text Preprocessing in NLP Using Python
Hands-On Workshop On NLP Text Preprocessing Using Python