A Tutorial on Natural Language Processing (NLP)

Pre-processing your data

Aroshi Ghosh

Published in

Student Spectator

5 min readDec 23, 2020

This tutorial helps you to understand how to pre-process your data for natural language processing.

What is NLP?

Have you ever wondered how smart assistants such as Google’s Cortona or Amazon’s Alexa “understand” what we tell them? Perhaps you have used Optical Character Recognition (OCR) software to take a picture of a textbook page, which got converted into a typed document? These are examples of Natural Language Processing (NLP).

Natural Language Processing is a field where computers generate and understand language, which can include useful data. There are many parts to NLP but the most important is learning how to pre-process your data effectively. To do this, you must understand the variety of pre-processing techniques that are available as well as what your data represents. Throughout this article, we will be exploring various kinds of pre-processing algorithms and when to use them.

Pre-processing Tutorial

Use the following steps to pre-process your data and use NLP:

Step 1: Import NLTK library and other packages to obtain sample data

To begin, we must import a few libraries from Python.

Our main library will be NLTK, which is a Natural Language Processing Toolkit. Additionally, we will import pandas, matplotlib.pyplot, and NumPy for data visualization. These lines of code will give us access to a few sample text datasets as well as tools needed to process our data.

Step 2: Tokenize your data to break it up into words and phrases

The next step in pre-processing is tokenization. Tokenization allows us to break up large amounts of text into phrases or words so that the computer can find value and meaning within the data. There are two main types of tokenization:

Sentence-level tokenization
Word-level tokenization

Depending on your goal and data, each type has its own benefits. For example, if you have short text messages and you want to sort them, it might be more beneficial to break up each text message into word tokens. However, if you are looking at a long article or book, it might be better to look at sentence tokens.

Before you decide on which type of tokenization to use, print out a sample of your data after it has been tokenized in all formats. Then, view the tokens and the token that will be most useful for your project.

Step 3: Use the Frequency Distribution function to display frequently used words.

For the rest of this tutorial, we will be utilizing the sample text we imported earlier from the NLTK library.

To fully understand what is happening within our data, we can use a frequency distribution function that will display the most frequent tokens and words.

When you run the results, you might see repetitions between words due to differences in capitalizations or even meaningless words such as “a” and “to” along with the punctuation. All these extraneous items must be removed so that the NLP model can focus on the important parts of the data.

Step 4: Cleaning up your data

(i) Remove capitalizations

While working with frequency distributions, we noticed that there were many words that are identified as “different” tokens because some instances were capitalized while others were not. In order to combine these groups, we must convert all the characters into their lowercase forms.

(ii) Remove punctuations

Next, we must remove all meaningless punctuation. This allows our model to entirely focus on the important phrases in the text instead of the number of periods.

There are a couple of different ways to remove punctuation.

Try out all three methods and print out the results. Double-check to see if your data appears the way you want it to. Note that the three methods remove a different list of punctuation.

(iii) Remove stop words

As you may have noticed in the example for frequency distribution, there are many instances of words such as “a” and “to,” but these words do not provide meaning to the data. These words are called “stop words.” Before you input your data into your model, depending on your goal, you may want to remove stop words.

(iv) Process collocated data

Often, when dealing with long lengths of text, you may encounter certain words that appear together. These words may be names of people, places, or things that may not be separated.. These are examples of collocated data.

There are two main types of collocations:

Bigrams
Trigrams

Bigrams consist of two words that appear together. Trigrams consist of three words. Usually, we do not encounter collocations of more than three words and it is highly unlikely that you will find any such instances. If you do, it is a signal of bad and/or repetitive writing.

In order to sort out bigrams and trigrams, you can run the following cells.

Conclusion

Hopefully, you enjoyed reviewing these steps to pre-process your data for Natural language Processing. These algorithms and functions can help you get started, but it is still up to you to continue to learn and apply these skills to various projects. You may switch the order in which you process the data based on what works best for your project. When working with NLP, ask yourself two questions: “What is my goal?” and “What is my data?”