Preprocessing Steps for Natural Language Processing (NLP): A Beginner’s Guide

Maleesha De Silva
5 min readApr 30, 2023

--

Machine Learning heavily relies on the quality of the data fed into it, and thus, data preprocessing plays a crucial role in ensuring the accuracy and efficiency of the model. In this article, we will discuss the main text preprocessing techniques used in NLP.

1. Text Cleaning

In this step, we will perform fundamental actions to clean the text. These actions involve transforming all the text to lowercase, eliminating characters that do not qualify as words or whitespace, as well as removing any numerical digits present.

I. Converting to lowercase

Python is a case sensitive programming language. Therefore, to avoid any issues and ensure consistency in the processing of the text, we convert all the text to lowercase.

This way, “Free” and “free” will be treated as the same word, and our data analysis will be more accurate and reliable.

df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)
Dataset before and after converting all text into lowercase

II. Removing URLs

When building a model, URLs are typically not relevant and can be removed from the text data.

For removing URLs we can use ‘regex’ library.

import pandas as pd
import re

# Define a regex pattern to match URLs
url_pattern = re.compile(r'https?://\S+')

# Define a function to remove URLs from text
def remove_urls(text):
return url_pattern.sub('', text)

# Apply the function to the 'text' column and create a new column 'clean_text'
df['Message'] = df['Message'].apply(remove_urls)

III. Removing remove non-word and non-whitespace characters

It is essential to remove any characters that are not considered as words or whitespace from the text dataset.

These non-word and non-whitespace characters can include punctuation marks, symbols, and other special characters that do not provide any meaningful information for our analysis.

df = df.replace(to_replace=r'[^\w\s]', value='', regex=True)
Dataset before and after removing all non-word and non-whitespace characters

IV. Removing digits

It is important to remove all numerical digits from the text dataset. This is because, in most cases, numerical values do not provide any significant meaning to the text analysis process.

Moreover, they can interfere with natural language processing algorithms, which are designed to understand and process text-based information.

df = df.replace(to_replace=r'\d', value='', regex=True)
Dataset before and after removing digits

2. Tokenization

Tokenization is the process of breaking down large blocks of text such as paragraphs and sentences into smaller, more manageable units.

Tokenization of text

In this step, we will be applying word tokenization to split the data in the ‘Message’ column into words.

By performing word tokenization, we can obtain a more accurate representation of the underlying patterns and trends present in the text data.

import nltk
from nltk.tokenize import word_tokenize

df['Message'] = df['Message'].apply(word_tokenize)
Dataset after tokenization

3. Stopword Removal

Stopwords refer to the most commonly occurring words in any natural language.

For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Therefore, removing stopwords can help us to focus on the most important information in the text and improve the accuracy of our analysis.

One of the advantages of removing stopwords is that it can reduce the size of the dataset, which in turn reduces the training time required for natural language processing models.

Stopword removal

Various libraries such as ‘Natural Language Toolkit’ (NLTK), ‘spaCy’, and ‘Scikit-Learn’ can be used to remove stopwords.

In this example, we will use the NLTK library to remove stopwords in the ‘Message’ column of our dataset.

import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
df['Message'] = df['Message'].apply(lambda x: [word for word in x if word not in stop_words])
Dataset after removal of stopwords

4. Stemming/Lemmatization

What’s the difference between Stemming and Lemmatization?

Stemming vs Lemmatization

There are various algorithms that can be used for stemming,

· Porter Stemmer algorithm

· Snowball Stemmer algorithm

· Lovins Stemmer algorithm

Stemming

Let’s take a look at how we can use ‘Porter Stemmer’ algorithm on our dataset.

Some basic rules defined under the Porter Stemmer algorithm are,

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import pandas as pd

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Define a function to perform stemming on the 'text' column
def stem_words(words):
return [stemmer.stem(word) for word in words]

# Define a function to perform stemming on the 'text' column
def stem_words(words):
return [stemmer.stem(word) for word in words]

# Apply the function to the 'text' column and create a new column 'stemmed_text'
df['stemmed_messages'] = df['Message'].apply(stem_words)

Lemmatization

Next, let’s take a look at how we can implement Lemmatization for the same dataset.

import nltk
nltk.download('averaged_perceptron_tagger')
import nltk
nltk.download('wordnet')

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# define function to lemmatize tokens
def lemmatize_tokens(tokens):
# convert POS tag to WordNet format
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)

# lemmatize tokens
lemmas = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]

# return lemmatized tokens as a list
return lemmas

# apply lemmatization function to column of dataframe
df['lemmatized_messages'] = df['Message'].apply(lemmatize_tokens)

The above code segments will produce outputs as shown below.

Stemmed and Lemmatized dataset

Note that, we only use either Stemming or Lemmatization on our dataset based on the requirement.

Conclusion

In this article we discussed main preprocessing steps in building an NLP model, which include text cleaning, tokenization, stopword removal, and stemming/lemmatization. Implementing these steps can help improve model accuracy by reducing the noise in the text data and converting it into a structured format that can be easily analyzed by the model.

References

--

--

Maleesha De Silva

Final Year Data Science Undergraduate | Beta Microsoft Learn Student Ambassador