Preprocessing Steps for Natural Language Processing (NLP): A Beginner’s Guide
Machine Learning heavily relies on the quality of the data fed into it, and thus, data preprocessing plays a crucial role in ensuring the accuracy and efficiency of the model. In this article, we will discuss the main text preprocessing techniques used in NLP.
1. Text Cleaning
In this step, we will perform fundamental actions to clean the text. These actions involve transforming all the text to lowercase, eliminating characters that do not qualify as words or whitespace, as well as removing any numerical digits present.
I. Converting to lowercase
Python is a case sensitive programming language. Therefore, to avoid any issues and ensure consistency in the processing of the text, we convert all the text to lowercase.
This way, “Free” and “free” will be treated as the same word, and our data analysis will be more accurate and reliable.
df = df.applymap(lambda x: x.lower() if isinstance(x, str) else x)
II. Removing URLs
When building a model, URLs are typically not relevant and can be removed from the text data.
For removing URLs we can use ‘regex’ library.
import pandas as pd
import re
# Define a regex pattern to match URLs
url_pattern = re.compile(r'https?://\S+')
# Define a function to remove URLs from text
def remove_urls(text):
return url_pattern.sub('', text)
# Apply the function to the 'text' column and create a new column 'clean_text'
df['Message'] = df['Message'].apply(remove_urls)
III. Removing remove non-word and non-whitespace characters
It is essential to remove any characters that are not considered as words or whitespace from the text dataset.
These non-word and non-whitespace characters can include punctuation marks, symbols, and other special characters that do not provide any meaningful information for our analysis.
df = df.replace(to_replace=r'[^\w\s]', value='', regex=True)
IV. Removing digits
It is important to remove all numerical digits from the text dataset. This is because, in most cases, numerical values do not provide any significant meaning to the text analysis process.
Moreover, they can interfere with natural language processing algorithms, which are designed to understand and process text-based information.
df = df.replace(to_replace=r'\d', value='', regex=True)
2. Tokenization
Tokenization is the process of breaking down large blocks of text such as paragraphs and sentences into smaller, more manageable units.
In this step, we will be applying word tokenization to split the data in the ‘Message’ column into words.
By performing word tokenization, we can obtain a more accurate representation of the underlying patterns and trends present in the text data.
import nltk
from nltk.tokenize import word_tokenize
df['Message'] = df['Message'].apply(word_tokenize)
3. Stopword Removal
Stopwords refer to the most commonly occurring words in any natural language.
For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Therefore, removing stopwords can help us to focus on the most important information in the text and improve the accuracy of our analysis.
One of the advantages of removing stopwords is that it can reduce the size of the dataset, which in turn reduces the training time required for natural language processing models.
Various libraries such as ‘Natural Language Toolkit’ (NLTK), ‘spaCy’, and ‘Scikit-Learn’ can be used to remove stopwords.
In this example, we will use the NLTK library to remove stopwords in the ‘Message’ column of our dataset.
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['Message'] = df['Message'].apply(lambda x: [word for word in x if word not in stop_words])
4. Stemming/Lemmatization
What’s the difference between Stemming and Lemmatization?
There are various algorithms that can be used for stemming,
· Porter Stemmer algorithm
· Snowball Stemmer algorithm
· Lovins Stemmer algorithm
Stemming
Let’s take a look at how we can use ‘Porter Stemmer’ algorithm on our dataset.
Some basic rules defined under the Porter Stemmer algorithm are,
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import pandas as pd
# Initialize the Porter Stemmer
stemmer = PorterStemmer()
# Define a function to perform stemming on the 'text' column
def stem_words(words):
return [stemmer.stem(word) for word in words]
# Define a function to perform stemming on the 'text' column
def stem_words(words):
return [stemmer.stem(word) for word in words]
# Apply the function to the 'text' column and create a new column 'stemmed_text'
df['stemmed_messages'] = df['Message'].apply(stem_words)
Lemmatization
Next, let’s take a look at how we can implement Lemmatization for the same dataset.
import nltk
nltk.download('averaged_perceptron_tagger')
import nltk
nltk.download('wordnet')
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
# initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# define function to lemmatize tokens
def lemmatize_tokens(tokens):
# convert POS tag to WordNet format
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
# lemmatize tokens
lemmas = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
# return lemmatized tokens as a list
return lemmas
# apply lemmatization function to column of dataframe
df['lemmatized_messages'] = df['Message'].apply(lemmatize_tokens)
The above code segments will produce outputs as shown below.
Note that, we only use either Stemming or Lemmatization on our dataset based on the requirement.
Conclusion
In this article we discussed main preprocessing steps in building an NLP model, which include text cleaning, tokenization, stopword removal, and stemming/lemmatization. Implementing these steps can help improve model accuracy by reducing the noise in the text data and converting it into a structured format that can be easily analyzed by the model.