Text Preprocessing for NLP Applications

Understand the basics of text preprocessing with examples

Yohan Kulasinghe
LinkIT
3 min readMay 17, 2020

--

Image by Free-Photos from Pixabay

Text preprocessing is the first step of every Natural Language Processing (NLP) based application. These are some areas that you can play around to preprocess your text.

  1. Cleaning
  2. Normalization
  3. Tokenization
  4. Stop Word Removal

To give practical examples for these areas here, I use the “nltk” — “Natural Language Processing Toolkit” Python library. You can easily install the “nltk” library by using the following command in your preferred terminal. (I’m assuming that you have installed python in your machine and having basic skills of python).

1. Cleaning

As we discussed earlier, Text contents we use are having source-specific markings like HTML tags. I’ve written a dedicated article for scraping web sources and clean them to get the plain text. Before you enter the normalization section please read this article.

2. Normalization

Case Normalization — Before further processing, the whole text should be brought to the same case. We can bring the text to the lower case by “lower()” method in Python.

Screenshot 2.0 by Author

Punctuation Removal — Depending on the task that you want to accomplish, you may want to remove punctuation. That can be achieved by applying a “regular expression” filter as follows.

Screenshot 2.1 by Author

3. Tokenization

In NLP, our tokens are individual words. So, tokenization means simply splitting of sentences into a set of words. This can be achieved very easily by using the “nltk”’s “word_tokenization()”.

Screenshot 3.0 by Author

If you are getting an error saying no resource, open the Python shell and type the following lines. You usually get this error initially. Then run the above code again.

4. Stop Word Removal

These are uninformative words, such as “a”, “an”, “the”. Meaning of the sentence can still infer even after removing those words. Removing those words will help to avoid unnecessary complexity in the prevailing stages.

A word may be a “stop word” in one application but useful in another application. So this stage needs some attention. You can see the default “stop words” by running the following line.

Screenshot 4.0 by Author

If you getting an error saying no resource, open the Python shell and type the following lines. (Usually, the first time you get this error) Then run the above code again.

Removing “stop words” can be achieved using the following lines of code.

Screenshot 4.1 by Author

There are more techniques in word preprocessing such as “Named Entity Recognition”, “ Stemming And Lemmatization” etc. I’ll discuss those in future articles.

Stay tuned for more articles.

--

--

Yohan Kulasinghe
LinkIT

Undergraduate at Faculty of IT, University of Moratuwa