Building A Basic Text Preprocessing Pipeline In Python

How to build a text preprocessing pipeline for unstructured text data in Python

--

Although the most cutting edge NLP language models, such as BERT, come with their own code for tokenization, removing the need to any preprocessing, many more basic NLP techniques still require varying levels of preprocessing before they can be implemented. Python’s large array of libraries provide a helping hand for this, but it can sometimes be difficult to decide what preprocessing tasks to implement, and how to implement them successfully.

In this article, we will go through the following standard preprocessing techniques to understand what they do, when to use them and how to implement them in Python.
This list is by no means exhaustive:

  • Whitespace normalization
  • Lower case
  • Punctuation and number removal
  • Stopword removal
  • Spelling correction
  • Stemming and lemmatization

Every time I worked on a new NLP project, I found myself creating a new preprocessing pipeline in Python. To reduce this workload, over time I gathered the code for the different preprocessing techniques and amalgamated them into a TextPreProcessor Github repository, which allows you to create an entire text preprocessing pipeline in a few lines of Python code. All the code below is adapted from this repository.

1. Whitespace Normalization

This is the replacement of multiple sequential whitespaces with a single whitespace, as well as removing leading and trailing whitespaces (whitespaces at the start or end of a string).

Photo by Kelly Sikkema on Unsplash

When would you use it?

  • Remove whitespaces that may have been created by removal of stop words, punctuation, etc.
  • Stops spaces being added to tokens if using a tokenizer that splits based on a single space.
  • Improves readability if text will be outputted for human review.
  • When text will be searched for multiple words or phrases using a basic match phrase.

How do you implement it?

Python code:

s = ‘   The quick brown    fox jumps over the lazy dog   ‘print(‘ ‘.join(s.split()))

Output:

The quick brown fox jumps over the lazy dog

2. Lower Case

Replacing all upper case alphabetical character with their lowercase counterpart.

Photo by Alexander Andrews on Unsplash

When would you use it?

  • When case provides no extra information. Generally, this is when you don’t need to know if a word is at the start of a sentence, or you’re not interested in tone (“That’s enough” vs. “THAT’S ENOUGH”).
  • When using a pre-trained language model, trained only on lower case text, if not using the prebuilt tokenizer.
  • You may want to avoid lower casing with languages, such as German, that capitalise all nouns. In this case, retaining the capitals is useful for identifying and distinguishing nouns from other words.

How do you implement it?

Python Code:

s = ‘The quick brown Fox jumps over the lazy Dog’
print(s.lower())

Output:

the quick brown fox jumps over the lazy dog

3. Punctuation and Number Removal

Remove all punctuation and numerical characters from a text. In some cases, numbers may also be replaced with a token, such as ‘NUMB’.

Photo by Nick Hillier on Unsplash

When would you use it?

  • If using Bag of words, especially unigrams, which is based on counting occurrences of words. As punctuation and numbers often only retain significant meaning in their given context, they will provide little useful information in isolation.
  • Number removal with bigrams, trigrams, etc. if there is no interest in knowing the number of objects (we need to know there are cats, not that there are three of them).
  • Remove punctuation when you are not interested in tone (“That’s it!” and “That’s it?” have different meanings).

How do you implement it?

The following example removes the following punctuation characters: !”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~

Python code:

import string
import regex as re
punctuation = string.punctuation
s = 'The 1 quick, brown fox jumps over the 2 lazy dogs.'
print(s.translate(str.maketrans('', '', punctuation)))
print(re.sub(r'\d+', '', s)

Output:

The quick brown fox jumps over the lazy dog
The quick, brown fox jumps over the lazy dogs.

Note: see in the case of number removal why whitespace normalization may need to follow.

4. Stopword Removal

Remove common words which generally do not provide any extra information, such as “a”, “and” and “is”.

Photo by Jose Aragones on Unsplash

When would you use it?

  • Bag of words, where counts of common unigrams do not add any information, so unnecessarily increases the size of the feature space. You may, in some cases want to consider keeping them for bigrams, trigrams, etc.
  • You are not interested in negative feedback, i.e. you don’t need to know if a word is preceded by words like ‘not’ or ‘no’.
  • You may want to add additional words to a stopword list in order to remove words, or any sequence of characters, that add no useful information in your specific corpus (e.g. the word “car” in a corpus of car reviews).

How to implement it?

Here, we use the Spacy stopwords. However, other libraries, such as NLTK, have their own stopword lists. As mentioned above, you could even create your own list.

Python code:

import spacy
spacy.load('en_core_web_sm')
stopwords = spacy.lang.en.stop_words.STOP_WORDS
s = 'the quick brown fox jumps over the lazy dog'
print(' '.join([word for word in s.split(' ') if word.lower() not in stopwords]))

Output:

quick brown fox jumps lazy dog

5. Spelling Correction

Replacing words not present in a predefined list with a similar word on that list.

Photo by Kim Gorga on Unsplash

When would you use it?

  • Effective spelling correction is useful in almost all NLP use-cases. However, it can be very computationally expensive to find a replacement word. Therefore, the benefit needs to be weighed against the increased processing compute needed.
  • In some cases, it may be more efficient just to remove unrecognised words, rather than replace them, if computational cost is a significant factor.

How to implement it?

The pyspellchecker library provides an implementation of spelling correction based on Levenshtein distance. This is the minimum number of single-character edits (deletions, insertions, substitutions) needed to get from one word to the other.

Python code:

from spellchecker import SpellChecker
spell = SpellChecker()
s = 'The qiuck brown fox jmps over the lazy dog'
print(' '.join([spell.correction(word) for word in s.split(' ')]))

Output:

The quick brown fox mps over the lazy dog

Note: the word “jmps” was incorrectly replaced with “mps” here, showing that this method is by no means perfect.

6. Stemming or Lemmatization

Often in text a word can appear in several different forms (e.g. jump, jumps, jumping) and in other cases, words may derive from a common meaning (e.g. democracy, democratic). The goal of both stemming and lemmatization is to replace these different forms with a singular base form of the word.

Stemming achieves this by following a set of heuristics that chop off, and sometimes replace, the ends of words. E.g.
Quick → Quick
Quicker → Quicker
Quickly → Quick
Quickened → Quicken

Lemmatization is a more involved process, which involves the study of the words, replacing them only with a dictionary defined lemma. E.g.
Went → Go
Going → Go
Gone → Go
Came → Come

Photo by Simon Berger on Unsplash

When would you use it?

  • Smaller text corpora, where there aren’t enough occurrences of different forms of a word to learn their meaning independently.
  • Information retrieval, where you are looking for a certain subject matter (if you search for running, you also want results for run, runs, runner, etc.).
  • Document clustering, where you want to split documents by subject.

How do you implement it?

There are many different libraries that implement stemming and lemmatization, and many different methods of stemming. In this example we will be using:

Python code:

import spacy
from nltk.stem.snowball import SnowballStemmer
s = 'The quickest brown fox was jumping over the lazy dog'
sp = spacy.load('en_core_web_sm')
stemmer = SnowballStemmer(language='english')
print(f"Stemming: {' '.join([stemmer.stem(word) for word in s.split(' ')])}")
print(f"Lemmatization: {' '.join([token.lemma_ for token in sp(s)])}")

Output:

Stemming: the quickest brown fox was jump over the lazi dog
Lemmatization: the quick brown fox be jump over the lazy dog

Conclusion

Knowing which text preprocessing techniques to use is less a science and more trial and error mixed with experience. For many NLP tasks, say a classification model using bag of words, it might be best to try training your model with little to no preprocessing first. From there, you can go back and add preprocessing steps to see what improves model performance.

This list is by no means exhaustive, but I hope it provides a good reference for any NLP tasks you face!

--

--