NLP Preprocessing Pipeline — what, when, why?

Tiago Duque
Analytics Vidhya
Published in
9 min readJan 21, 2020

This article is part on a series that aims to clarify the most important details of NLP. You can refer to the main article here.

After some story, we get to see when and why to apply NLP. In this track, there’s an important concept called “preprocessing” — one that is common to any area of Data Science (you want your data to get neat and clean, right?).

But, while in numerical data you’ll usually apply some normalization rules (reduce difference between max and min values), drop and fill NaNs (that means empty values) and detect outliers (points out of the curve); in NLP you’ll have a ton more of work.

Since words and phrases are more complex than integer or even real numbers (ok, no pure mathematicians here, but since you can represent a word with a set of real numbers, we can assume that they are more complex), the data has to pass through several stages of preprocessing — hence, the use of the term Preprocessing Pipeline.

The stages of the Pipeline depends mainly on your project and purposes. Before mentioning common combinations, let us first do a quick presentation of the most common ones:

  • Bare String Preprocessing: usually disconsidered as a part of NLP Preprocessing Pipeline itself, is the act of using programming language specific functions to modify input strings (such as replacing characters, applying regex, etc).
text = "My friend is at his home now. "
text = text.replace("My", "His")
text = text.strip()
  • Tokenization: This is usually considered the first stage of the Pipeline (and needed for all future stages, therefore, it is mandatory). Tokenization “splits” the phrase into “tokens” (or bits) — each token can contain words, punctuation or special characters (if that’s the case for the language). While it may look simple, a good Tokenizer goes way beyond just splitting at space characters.
Tokenization Example from my Master’s.
  • Stemming: A Stem is the “core structure” of a plant. All leaves are connected to distinct parts of the Stem. A similar concept can be applied to words. Words are leaves, their “core” are stems (think of a trie data structure if you know what I mean). Stemming is the process of reducing each word (encompassed by a token) to its core elements, reducing temporal, gender (if that’s the language case) and grade variety (considering that this information is appended to the last part of the word). With word reduction, this means less sparse data to work with (and less information, if you must)! Therefore, in an ideal world, eating would become eat, making and make would become mak (yeah, looks weird) and so on… Be aware that Stemming does not care whether the resulting stem exists in your language vocabulary.
Stem, Ste, St, S
  • [Mid-step] Tagging: Before presenting the next step, there’s one that is sometimes forgotten by practitioners. The Tagging process is a step where you attribute a Tag (or more than one) to each token. Usually, this tag is the part of speech (noun, verb, adjective, pronoun, etc) for the word represented by the token (called morphological analysis). The tagging process is usually done using either rule-based systems (classical) or machine learning (more recently). This means that the token will carry more information than just a string of characters, which can help in future stages of the pipeline. Be aware that if you use stemming, this tagging may (will) not work properly.
Simple POS Tagging courtesy of parts-of-speech.info. Remember, Computers make mistakes too! 🤖
  • Lemmatizing: Now this is an “evolution” of stemming. You’ll usually work with it more often. Instead of just reducing a word to its stem (which in practice you’ll notice is not as useful as it seems), Lemmatizing reduces a word to its Lemma, or core representative word. However, to work properly, the Lemmatizer requires the word Part of Speech (POS) to be identified — this is how a lemmatizer can decide whether banks should be lemmatized to bank or not (surname, Banks, get it!?). Currently this is done by a combination of machine learning techniques and hand-made rules.
One example where we see the POS tagging making a difference.
  • Stop Word Removal: This activity proposes to remove “non-relevant” words, shrinking the dictionary down to what’s supposed to aggregate value to the model (the criteria for what is relevant and what is not may vary between context and purpose). In many cases it is common to remove “stop words”, or pronouns, conjunctions, determiners and prepositions (not to forget punctuation) — this can be done by a crude dictionary or by using POS tagging (remember that you first need these words in place to do tagging and then later you can remove them 😅🌴). Another method is by removing the most frequent words. Be aware that this means losing information! Do it at your own intellectual risk.
🛑words list from nlkt. Nice and simple tutorial from tutorialexample.com.

Input Transformation

Now, supposedly, we preprocessed our natural language input, adding and removing tidbits of information according to our will. If you’re actually applying more specialized NLP activities, there are more steps to understand. However, suppose you’re doing a shallow text-classification task. Where to go now?

The next preprocessing steps aim to prepare your input to be used by machine learning models (how to vectorize them).

  • Count-Vectorization (Bag-of-words): The classical way of preparing a text for machine learning processing. This technique creates a dictionary of all words in the universe of documents vocabulary (has to be defined prior to training). Then, for each word in the input document, the dictionary counts one up for that position. This results in a HUGE sparse vector (a list where many positions are empty, since each position represent a word). That is why there are some preprocessing steps prior to this focused on reducing vocabulary variety. Btw, this is what is called Bag of Words (a bag, because [wo- arhem!] bags are messy — there’s no order to words once they enter).
Behold the power of Count-Vectorizer (Bag of Words) — it can make you rich! 😆
  • TF-IDF: Shirt for Term Frequency, Inverse Term Frequency. It is an advancement for Bag of Words where, instead of just giving scores based on frequency, it balances out the score for all words (frequency) in all documents and each word in a single document. The idea is that it provides better insight about which are the most important words in a document. Still kills word order, though (it also creates a huge vector to pose as dictionary).
This picture from computersciencemaster.com.br shows the idea behind TF-IDF: the most important words are neither the most frequent nor the least frequent, but what lies in the middle.
  • Word Embeddings: This one will be a little longer, so bear with me.

Now, the previous attempts had that inconvenient problem of losing word order — this can have several unwanted side effects, such as the one mentioned in the BoW picture.

But how to provide machine learning programs their so needed numerical input if the provided value is a text, in a sequential manner? Someone asked: What if the words could be represented by n-dimensional vectors? That’s the basic explanation of word embeddings — words represented as vectors (remember: a vector is composed of a intensity and a direction for each of the system’s dimensions). The first and most famous implementation of word embeddings is Word2Vec, by Mikolov et al.

First there’s the need to “train” the embeddings, then you can use them — this means assigning a vector in a n-dimensional vector space for each word — the word vector is “positioned” based on the context in which it appears in the “training data”. Then, instead of an array (lets mix words) of frequencies with the vocabulary size, there’s a fixed length array (say, the maximum input size) filled with the embeddings (those vectors) for each word.

This picture represents one of the greatest advantages of Word Embeddings — it preserves word order. Another advantage is in the reduced sparsity of input vectors (no more big zero vectors, sir!)

Interestingly, it was shown that these embeddings can capture some word semantics, such as having related words “closer” (or almost equally distant) in the vector space.

Graphical representation for the results obtained by Mikolov et al with Word Embeddings — this picture shows that embeddings can capture some word semantics based on context (notice that words are points in space — vectors!).

Finally, it has to be added that Word Embeddings is specially good for Artificial Neural Networks (and Deep Learning), which run very well on vector multiplication.

!Important! Sentence Padding/Truncating: If you’re using Word Embeddings to prepare your input, it is very important that every input sentence have the same length. For that, Padding is used to increase sentence length by adding a special neutral ‘word’ (tag) a number of times to the end of each sentence. In an opposite way, if the sentence is too big, it has to be reduced — which is usually done using simple truncating mechanisms (loses data, but if you’re working with big enough datasets, it wont be so problematic).

Pipeline Examples:

In the following section, I’ll give a few simple examples of which steps can be used for some common NLP (and machine learning) tasks. Some of them will be implemented on future posts, but if you want to use some of the already available tools and get some results, here it goes:

Example of Default spaCy Pipeline
  • Classical Sentiment Analysis:

(1) Clean data removing special characters: keep only what can be useful for the context;

(2) Tokenization

(3) POS tagging

(4) Lemmatizing: this will allow us to reduce our vocabulary. Use stemming if no need for precision and speed is preferred.

(5) Remove stopwords: they won’t help a lot here. Doing it after POS Tagging can help to filter unwanted POS’s (keep adjectives, adverbs, verbs and nouns)

(6) Use BoW or TF-IDF (TF-IDF can supress the need for stopword removal, but large vocabulary will be maintained).

(7) Apply your traditional Machine Learning Normalization Techniques.

The rest is default Machine Learning.

  • Rule Based Information Extraction from Text:

This is an activity that I find very frequently in StackOverflow nlp questions. It is related to be able to extract structured information from unstructured text input. The preprocessing pipeline is very simple, because we want to enjoy the most of the morphological features:

(1) Tokenization

(2) POS tagging

(3) Parsing: thats a step that we did not talk about — because I don’t consider it “preprocessing”, but rather core activity.

(4) NER: another step I don’t consider “preprocessing”.

→ Apply syntax/semantic rule matchers.

  • Smart Text Autocomplete/NLG:

(1) Tokenization

(2) Padding/Truncating

(3) Embedding

→ Train on RNN or Transformer model.

P.S.: In transformer models there’s almost no need for preprocessing, since the use of large quantity of documents surpass any input irregularity. Basically, only good embeddings are needed.

  • Classical Question Answering:

(1) Clean data removing special characters

(2) Tokenization

(3) POS tagging

(4) Lemmatizing

(5) BoW

Do the same preprocessing to both questions and answers. After that, compare the question vector to all answer vectors (there are many ways to do that and this is not the purpose of this post).

--

--

Tiago Duque
Analytics Vidhya

A Data Scientist passionate about data and text. Trying to understand and clearly explain all important nuances of Natural Language Processing.