NLP Pipeline: An Overview

Get started with Natural Language Processing pipelines

Published in

LinkIT

4 min readMay 15, 2020

What comes to your mind when you hear the word pipeline? A plumber’s work of supplying water to your house? No, here we are talking about the process and steps you need to follow in order to achieve your desired outcome from NLP enabled systems.

NLP Pipeline?

Basically There are three steps in the NLP pipeline.

Text Processing >> Feature Extraction >> Modeling

The first step of this pipeline is getting row text, Clean/Normalize, and converted to the format that can be used to extract features from the text.
The second step is feature extraction. It produces feature representation.
Feature representation is the input for the third step, in which you think suits the model that you use and the relevant to your task that you need to accomplish through the NLP system.

This is not a linear process at all. If you are not happy with the outcome you get from the model, you can go back to where you extracted features and change it. You can even go back to the first step where you normalize the row text and change and optimize the ways you used to normalize data.

Text Preprocessing

Image by pch.vector from www.freepik.com

Why do we need to preprocess the text? Let me take an example, most of the time, we get the text from internet sources because the internet is the most common and rich text container in the world. You may realize that when you extract text from web sites, it includes markers and HTML tags, some of those are irrelevant to infer meaning from the context. So those are needed to be removed.

Text inputs can be taken from many sources like speech recognition endpoints, PDF documents, an OCR scanned copy of a book. All those sources include their source-specific markings. Our goal here is to extract plain text from any source. So, all the source-specific markings need to be removed.

Not only cleaning, but some processes are also needed to be performed here. Like, getting rid of capitalization (the meaning is not affected by capitalizing) — for that, we can convert text into lowercase. Punctuation should be removed. Common words like “a”, “an”, “the” should be removed because some times they don’t add more to the meaning. Text preprocessing helps to avoid unnecessary complexity in later stages.

Feature Extraction

Image by rawpixel.com from www.freepik.com

After getting a cleaned and normalized text. You can proceed further. But you cannot directly feed data to feature extraction because of the coding system used in modern computers.

The computer does not have a standard representation for words. The word is just only a sequence of characters for the computer. (here characters means an ASCII sequence)

You need to represent those preprocessed data according to the model and the task which you want to accomplish. Eg: If you want graph base representation to get insights, you need to use symbolic representation and map the relationship between text chunks. If you want statistical analysis, you need some sort of a quantitative representation. If you want to perform document-level analysis or sentimental analysis, you need a vector word or bag of words.

Likewise, you need to represent preprocess text according to your analysis. There are many representations you will learn along the way. Only through experience, you get to know which is the best fit for your task.

Modeling

This is the final stage. This usually requires Statistical or Machine Learning knowledge. After setting parameters for better optimization, you can use the system to get predictions and see unseen data. Many techniques can be used to build models such as Decision Trees, SVM, ANN, or any combined model.

At the end of this pipeline, you can deploy your fully trained system as a web-based application, integrate it with a mobile application, or embed it with an existing application.

I hope you got an overview of the NLP pipeline! Stay tuned for a more in-depth article with examples.