Journey through the World of NLP -NLP Pipeline! — Part-2.1

Dinesh Kumar
The Startup
Published in
9 min readJan 29, 2021

In the last part, we briefly discussed what is NLP? What are the areas NLP used heavily, why it’s such a hard problem to solve? This part will look into the steps involved in NLP.

Fig-1 — NLP Pipeline

Above image illustrates high level steps involved in building any NLP Model. This step by step process is called a pipeline. This process may not be linear in nature. Some steps need to revisit when it’s needed.

Data Collection

The first step of any machine learning problem is to collect the data relevant to the task. There are N number of ways you can acquire the data needed. Some common data collection technique in NLP are below.

Public Dataset

There is lot’s and lot of open sourced repository of data available on the internet. Some notable such repositories are

Web Scraping

Web scraping is another method to collect the necessary data directly from web pages. There are lot’s of tools available in python for such tasks.

Product Intervention

Collect the data from the product itself, like how Netflix, google, amazon collects data based on user search/viewing activity and recommend a product / shows based on such activity. Similar way we can collect necessary data from the product.

Data Augmentation

Lets take the case where data is available, but it’s not enough to apply any meaning full machine learning models, we can generate more data from existing dataset this process is called data Augmentation. Some methods to create such augmented data are:

Synonym Replacement : Randomly choose N words from the sentence/document. Make sure it’s not a stop words (a, the, an etc.,) and replace the selected word with synonyms. Synsets in NLTK wordnet will help to generate synonyms.

Back Translation : Translate the data into some other language with the help of google translate or amazon translate, then back translate to the source language. (English -> Japanese -> English)

Yours sincerely — > 敬具 —->Best regards

Fig-2 — Back translation

Replacing Entities : Replace person’s name, location, title, company name, roles, etc., ex: I went to Berlin -> I went to New York.

Adding Noise to the data : Intentionally adding noise such as spelling error, typos, etc., because many of NLP data contain noise due to where it’s coming from such as social media data. We can choose random words and replace with the closest spelling of the same word.

TF-IDF Based word replacement : Back translation helps to generate lots of data, but there is no guarantee that keywords and meaning preserved after translation. TF-IDF tackle this limitation. The concept of TF-IDF is that rare words contribute more weights to the model. Word importance will increase if the number of occurrences within the same document. Based on this TF-IDF score will be computed for each token and replace it according to the TF-IDF score.

Bigrams Flipping : Divide the sentence into bigrams. The bigram is a sequence of two adjacent elements from a string of tokens. Take one bigram at random and flip it. For example: “I am going to the new York.” Here, we take the bigram “going to” and replace it with the flipped one: “to going.”

Snorkel : It is a system for programmatically building and managing training datasets without manual labelling. In Snorkel, users can develop large training datasets in hours or days rather than hand-labelling them over weeks or months.https://www.snorkel.org/

EDA_NLP & NLP Aug : These are two python packages helps to create synthetic samples.

To effectively use the above techniques we need to start with text cleaning. That is the key requirement in order to the techniques we mentioned above works well. With that in mind, let’s dive into text cleaning.

Text Extraction and cleaning

The Process of extracting raw text from the input data by removing all other non text information. Nowadays, many of the data contain smileys, symbols, different encoding format. Text clean helps to remove those characters. Some common text cleaning process are below:

HTML Parsing and clean up

Lets take an example where we want to collect the bus timing of Manchester, England. First, we need to find is there an official API service available. Since there are many bus operators in the UK. it’s bit difficult to find API for all available bus routes. There is one website available specially to show bus timings in the UK. https://bustimes.org. will use bs4 python package to parse html and retrieve the data we needed from the HTML document. Below is the sample code snippet to retrieve Liverpool street station departure board.

Fig-3 HTML Parsing

Unicode Normalization

When collecting data from various resources it’s not uncommon to encounter encoding issues when the dataset have different locales. Especially social media data heavily contain smileys and other characters.

Fig-4 Unicode normalization

Spelling correction

The data we receive often have spelling errors, due to a fat finger problem or fast typing. There are lot’s of third party spelling correction API available to save tons of time for us. We can utilize those tools to correct the spelling of a given dataset. There is another method we can build our own spell checker for the language we plan to apply the model using a huge dictionary of words.

System specific error correction

Textual data can come from various sources not only from internet raw html or social media texts. It can come from pdf or OCR scanned images, Each pdf might not share common encoding format. So we need to use different packages depend on the encoding. There are some great python packages out there to extract data from pdf to text format. Ex : Pypdf, pdfminer

Pre-Processing

To extract specific and relevant information from raw text is one of the crucial steps in NLP pipeline. All NLP software typically works at sentence level and expect separation from word level. We need to split the text into words and sentences before proceeding further. Below are the some common pre-processing technique.

NLP will typically analyses text by breaking it up into words and sentences. Any NLP pipeline has to start with a reliable way to split the text into sentence and further split the sentences into words.

Sentence segmentation

We can do sentence segmentation by breaking up text into sentences based on full stops and question marks. Some texts like (Dr. Er.)may break this rule. So we can use python NLTK package to perform sentence segmentation.

Fig-5 Sentence Segmentation

Word Tokenization

Similar to sentence it will break text into words, rather than sentences.

Fig-6 Word tokenization

Stop words, digit, punctuation removal & Lower case the text

Tasks like news classification is needed more pre-processing steps than simple sentence/word segmentation. We need to think about what information is relevant to group the text into specific categories. Usually stop words doesn’t carry much weight. Same way upper class and lower class may not make a difference for the problem.

Fig 7 — Stop word removal

Stemming and Lemmatization

Stemming is the process of removing the prefix and suffix and reducing the word to base form. Eg: bike / bikes = bike. This is done by applying a fixed set of rules. Stemming may not always end up in a linguistically correct form. Stemming commonly used in search engine to match user queries to retrieve relevant documents.

Lemmatization is the process of mapping all the different forms of a word to its base word or lemma. Lemmatization requires more linguistic knowledge. Model and develop efficient lemmatizers are still remains an open problem in NLP.

Fig-8 Stemming and lemmatization

Text Normalization

Working with social media post needs a different set of rules, because the data spelled different, shorthand forms, code mix data, emoji etc., we need to make a common representation of text and capture all the variations into one representation. This method is called text normalization. It includes lower casing the letter, expand the abbreviation, format/remove numbers, Unicode conversion etc.,

Language Detection

All the data we receive from internet will not be in English language. Some time it will be other language so we need to create a language specific pipeline. When input data comes we can detect the language using different python packages (polyglot) then the next step could be based on the language specific pipeline.

Code mixing and Translation

This is where the content is non-English language or more than one language. Code mixing refers the switching between languages. When people use multiple language in their write up they tend to add roman numerals, symbols, non-English language, etc., This data needs to be handled properly before moving on to other steps in the pipeline.

Not all of the steps are always necessary and not all of them are performed in order. For example, if we were to remove digits and punctuation, what is removed first may not matter much. However, We typically lower case the text before stemming. We also don’t remove tokens or lowercase the text before doing lemmatization because we have to know the part of speech of the word to get its lemma and that requires all tokens in the sentences to be intact. A good practice to follow is to prepare a sequential list of pre-processing tasks to be done after having a clear understanding of how to process our data.

Common pre process step includes Full Text → Sentence tokenization → Sentences → Lowercasing → Removal of punctuation → stemming / Lemmatization.

Part-of-speech (POS) Tagging

Some uncommon tasks like to identify specific terms like name, location, address from a large collection of data needed some non-traditional method in pre-processing stage. One such method is called POS tagging. Most of the POS tagging falls under Rule base POS Tagging, Stochastic POS Tagging and Transformation based tagging.

Rule base POS Tagging: One of the oldest techniques of tagging is rule based POS tagging. It uses dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag.

Stochastic POS Tagging: The model that includes frequency or probability can be called stochastic. the simplest stochastic approach are word frequency approach, Tag sequence probabilities.

Transformation based Tagging : Transformation based tagging is also called Brill tagging. It is an instance if the transformation based learning, which is a rule-based algorithm for automatic tagging of POS to the given text.

Feature Engineering

When we want to develop ML Models the pre-processed text needs to be feed into the ML Algorithm. Feature engineering is set of methods that will accomplish this task. this method also called feature extraction. The feature engineering goal is to capture text into a numeric vector that can be understood by ML algorithms.

Traditional NLP Pipeline :

Features are heavily inspired by the task at hand as well as domain knowledge. Advantage of handcrafted features is that model remain interpretable. It’s possible to quantify exactly how much each feature influencing the model.

DL Pipeline:

Handcrafted features may come as a bottleneck for the model performance and model development. A noisy feature can potentially harm the model performance. In DL pipeline raw data directly fed into the pipeline. where the model will learn the features. since it’s using all features it will loss interpretability. it’s hard to explain the DL prediction which is main disadvantage of business use case.

This is good point to stop this article, let’s continue modelling, Evaluation, Deployment, Monitoring, model updating in next series.! Thankyou !

--

--

Dinesh Kumar
The Startup

A passionate Full stack NLP Engineer, who is trying to uncover grant mystery about language.