Data Preprocessing Steps for NLP

The Complete NLP Guide: Text to Context #2

Merve Bayram Durna
7 min readJan 9, 2024

Welcome back, NLP enthusiasts! As we continue our journey through the vast landscape of Natural Language Processing (NLP), we’ve already explored its history, applications, and challenges in the initial blog. Today, we dive deeper into the heart of NLP — the intricate world of data preprocessing.

This post marks the second installment in our “The Complete NLP Guide: Text to Context” blog series. Our focus is crystal clear: we delve into the crucial data preprocessing steps essential for laying the groundwork for NLP tasks. While advancements in NLP have enabled the development of applications capable of perceiving and understanding human language, a critical prerequisite remains — preparing and supplying our machines with data in a format they can comprehend. This process involves a series of diverse and vital preprocessing steps.

Here’s what to expect in this deep dive:

  1. Tokenization and Text Cleaning: Discover the art of breaking down text into meaningful units and ensuring pristine, understandable language. This includes handling punctuation and refining the text for further processing.
  2. Stop Words Removal: Learn why removing certain words is essential for focusing on more meaningful content in the dataset.
  3. Stemming and Lemmatization: Dive into text normalization techniques, understanding when and how to use stemming or lemmatization to simplify words to their root forms.
  4. Part-of-Speech Tagging (POS): Explore how assigning grammatical categories to each word aids in a deeper understanding of sentence structure and context.
  5. Named Entity Recognition (NER): Uncover the role of NER in enhancing language understanding by identifying and classifying entities within the text.

Each of these steps is a critical building block in translating raw text into a language that machines can comprehend, setting the stage for more advanced NLP tasks.

By the end of this exploration, you’ll not only have a firm grasp of these fundamental preprocessing steps but also be prepared for the next phase of our journey — exploring advanced Text Representation Techniques. Let’s dive in and empower ourselves with the essentials of NLP data preprocessing. Happy coding!

1. Tokenization and Text Cleaning

At the heart of NLP lies the art of breaking down text into meaningful units. Tokenization is the process of splitting text into words, phrases, or even sentences (tokens). It’s the initial step that sets the stage for further analysis. Coupled with text cleaning, where we remove unnecessary characters, numbers, and symbols, tokenization ensures we work with pristine, understandable language units.

#!pip install nltk
# Example Tokenization and Text Cleaning
text = "NLP is amazing! Let's explore its wonders."
tokens = nltk.word_tokenize(text)
cleaned_tokens = [word.lower() for word in tokens if word.isalpha()]
print(cleaned_tokens)
['nlp', 'is', 'amazing', 'let', 'explore', 'its', 'wonders']

2. Stop Words Removing:

Not all words contribute equally to the meaning of a sentence. Stop words like “the” or “and” are often filtered out to focus on more meaningful content.

# Example Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
filtered_sentence = [word for word in cleaned_tokens if word not in stop_words]
print(filtered_sentence)
['nlp', 'amazing', 'let', 'explore', 'wonders']

3. Stemming and Lemmatizing

Stemming and lemmatization are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root forms. While they share the goal of simplifying words, they operate differently in terms of the linguistic knowledge they apply.

Stemming: Reducing to Root Forms

Stemming involves cutting off prefixes or suffixes of words to obtain their root or base form, known as the stem. The purpose is to treat words with similar meanings as if they were the same. Stemming is a rule-based method that doesn’t always result in a valid word, but it’s computationally less intensive.

Lemmatization: Transforming to Dictionary Form

Lemmatization, on the other hand, involves reducing words to their base or dictionary forms, known as lemmas. It takes into account the context of the word in a sentence and applies morphological analysis. Lemmatization results in valid words and is more linguistically informed compared to stemming.

When to Use Stemming vs. Lemmatization:

Stemming:

  • Pros: Simple and computationally less expensive.
  • Cons: May not always result in valid words.

Lemmatization:

  • Pros: Produces valid words; considers linguistic context.
  • Cons: More computationally intensive than stemming.

Choosing Between Stemming and Lemmatization:

https://www.nomidl.com/natural-language-processing/stemming-and-lemmatization/

The choice between stemming and lemmatization depends on the specific requirements of your NLP task. If you need a quick and straightforward method for text analysis, stemming might be sufficient. However, if linguistic accuracy is crucial, especially in tasks like information retrieval or question answering, lemmatization is often preferred.

In practice, the choice often depends on the trade-off between computational efficiency and linguistic accuracy based on the specific characteristics of your NLP application.

# Example Stemming, and Lemmatization 
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in filtered_sentence]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_sentence]

print(stemmed_words)
print(lemmatized_words)
['nlp', 'amaz', 'let', 'explor', 'wonder']
['nlp', 'amazing', 'let', 'explore', 'wonder']

4. Part-of-Speech Tagging:

Part-of-speech tagging (POS tagging) is a natural language processing task where the goal is to assign a grammatical category (such as noun, verb, adjective, etc.) to each word in a given text. This provides a deeper understanding of the structure and function of each word in a sentence.
The Penn Treebank POS Tag Set is a widely used standard for representing these part-of-speech tags in English text.

# Example Part-of-Speech Tagging 
from nltk import pos_tag
pos_tags = nltk.pos_tag(filtered_sentence)
print(pos_tags)
[('nlp', 'RB'), ('amazing', 'JJ'), ('let', 'NN'), ('explore', 'NN'), ('wonders', 'NNS')]

5. Named Entity Recognition (NER):

NER takes language understanding to the next level by identifying and classifying entities like names, locations, organizations, etc., in a given text. This is crucial for extracting meaningful information from unstructured data.

# Example Named Entity Recognition (NER) 
from nltk import ne_chunk

ner_tags = ne_chunk(pos_tags)
print(ner_tags)
(S nlp/RB amazing/JJ let/NN explore/NN wonders/NNS)

Real-World Applications of NLP Preprocessing Steps

While we’ve delved into the technical aspects of NLP preprocessing, it’s equally important to understand how these steps are applied in real-world scenarios. Let’s explore some notable examples:

Tokenization and Text Cleaning in Social Media Sentiment Analysis
In social media sentiment analysis, tokenization, and text cleaning are crucial. For instance, when analyzing tweets to gauge public sentiment about a new product, tokenization helps in breaking down tweets into individual words or phrases. Text cleaning is used to remove noise like hashtags, mentions, and URLs, which are common in social media text.

import re
def clean_tweet(tweet):
tweet = re.sub(r'@\w+', '', tweet) # Remove mentions
tweet = re.sub(r'#\w+', '', tweet) # Remove hashtags
tweet = re.sub(r'http\S+', '', tweet) # Remove URLs
return tweet

tweet = "Loving the new #iPhone! Best phone ever! @Apple"
clean_tweet(tweet)
'Loving the new ! Best phone ever! '

Stop Words Removal in Search Engines
Search engines extensively use stop word removal. When processing search queries, common words like ‘the’, ‘is’, and ‘in’ are often removed to focus on the keywords that are more likely to be relevant to the search results.

Stemming and Lemmatization in Text Classification
News agencies and content aggregators often use stemming and lemmatization for text classification. By reducing words to their base or root forms, algorithms can more easily categorize news articles into topics like ‘sports’, ‘politics’, or ‘entertainment’.

Part-of-Speech Tagging in Voice Assistants
Voice assistants like Amazon’s Alexa or Apple’s Siri use part-of-speech tagging to improve speech recognition and natural language understanding. By determining the grammatical context of words, these assistants can more accurately interpret user requests.

Named Entity Recognition (NER) in Customer Support Automation
NER is widely used in customer support chatbots. By recognizing and classifying entities such as product names, locations, or user issues, chatbots can provide more effective and tailored responses to customer inquiries.

These examples highlight the practical implications of NLP preprocessing steps in various industries, making the abstract concepts more tangible and easier to grasp. Understanding these applications not only provides context but also inspires ideas for future projects.

Conclusion

Throughout this article, we’ve meticulously navigated various data preprocessing steps essential for enhancing text for NLP tasks. From the initial breakdown of text through tokenization and cleaning to the more advanced processes of stemming, lemmatization, POS tagging, and Named Entity Recognition, we’ve laid a solid foundation for understanding and processing language data effectively.

However, our journey doesn’t end here. The processed text, while now more structured and informative, still requires further transformation to become fully comprehensible to machines. In our next installment, we will delve into Text Representation Techniques. These techniques, including the Bag-of-Words model, TF-IDF (Term Frequency-Inverse Document Frequency), and an introduction to Word Embeddings, are pivotal in converting text into a format that machines can not only understand but also utilize for various complex NLP tasks.

So, stay tuned as we continue to unravel the intricacies of NLP. Our exploration will equip you with the knowledge to transform raw text into meaningful data, ready for advanced analysis and application. Happy coding, and see you in our next post!

Explore the Series on GitHub

For a comprehensive hands-on experience, visit our GitHub repository. It houses all the code samples from this article and the entire “The Complete NLP Guide: Text to Context” blog series. Dive in to experiment with the codes and enhance your understanding of NLP. Check it out here: https://github.com/mervebdurna/10-days-NLP-blog-series

Feel free to clone the repository, experiment with the code, and even contribute to it if you have suggestions or improvements. This is a collaborative effort, and your input is highly valued!

Happy exploring and coding!

--

--