Text Preprocessing for NLP Applications

Understand the basics of text preprocessing with examples

Published in

LinkIT

3 min readMay 17, 2020

Text preprocessing is the first step of every Natural Language Processing (NLP) based application. These are some areas that you can play around to preprocess your text.

Cleaning
Normalization
Tokenization
Stop Word Removal

To give practical examples for these areas here, I use the “nltk” — “Natural Language Processing Toolkit” Python library. You can easily install the “nltk” library by using the following command in your preferred terminal. (I’m assuming that you have installed python in your machine and having basic skills of python).

pip install nltk

1. Cleaning

As we discussed earlier, Text contents we use are having source-specific markings like HTML tags. I’ve written a dedicated article for scraping web sources and clean them to get the plain text. Before you enter the normalization section please read this article.

2. Normalization

Case Normalization — Before further processing, the whole text should be brought to the same case. We can bring the text to the lower case by “lower()” method in Python.

# Source should contain your text sample
source = source.lower()
print(text)

Punctuation Removal — Depending on the task that you want to accomplish, you may want to remove punctuation. That can be achieved by applying a “regular expression” filter as follows.

import re# Source should contain your text sample
source = re.sub(r"[^a-zA-Z0-9]", " ", source)
print(text)

3. Tokenization

In NLP, our tokens are individual words. So, tokenization means simply splitting of sentences into a set of words. This can be achieved very easily by using the “nltk”’s “word_tokenization()”.

from nltk.tokenize import word_tokenize# Source should contain your text sample 
words = word_tokenize(source)
print(words)

If you are getting an error saying no resource, open the Python shell and type the following lines. You usually get this error initially. Then run the above code again.

import nltk
nltk.download('punkt')

4. Stop Word Removal

These are uninformative words, such as “a”, “an”, “the”. Meaning of the sentence can still infer even after removing those words. Removing those words will help to avoid unnecessary complexity in the prevailing stages.

A word may be a “stop word” in one application but useful in another application. So this stage needs some attention. You can see the default “stop words” by running the following line.

from nltk.corpus import stopwordsfrom nltk.corpus import stopwords
print(stopwords.words("english")

If you getting an error saying no resource, open the Python shell and type the following lines. (Usually, the first time you get this error) Then run the above code again.

import nltk
nltk.download('stopwords')

Removing “stop words” can be achieved using the following lines of code.

from nltk.corpus import stopwords#words array shoud contain word list 
words = [w for w in words if w not in stopwords.words("english")]
print(words)

There are more techniques in word preprocessing such as “Named Entity Recognition”, “ Stemming And Lemmatization” etc. I’ll discuss those in future articles.

Stay tuned for more articles.