Text Processing with Python for NLP Beginners: Fundamental Techniques (part 2)

Alexei Stepanov
4 min readJan 29, 2024

--

In this article, we are exploring more advanced text processing methods. Building on the concepts covered in the first part of this series (Basics of Text Processing with Python for NLP Tasks). I use Python’s NLTK library for now, however I will surely intoduce spaCy in next articles.

We will explore tokenization, text cleaning, the use of stop words, and the processes of stemming and lemmatization. Understanding and implementing these techniques are crucial for making the foundation of NLP, as you need to prepare your data for more complex tasks like sentiment analysis, topic modeling, and automated text summarization.

Do not let NLP hurt you. Source: https://www.behance.net/harsharya

1. Tokenization

Tokenization is the first and fundamental step in text preprocessing for NLP. By breaking down text into smaller parts, such as words, phrases, or symbols, tokenization allows for a more detailed and nuanced analysis of language. Each token acts like a puzzle piece, and by examining these pieces individually, we gain a better understanding of the overall picture.

Advanced NLP tasks, particularly those involving complex models like BERT (Bidirectional Encoder Representations from Transformers), heavily rely on tokenization. These models need the text to be broken down into tokens so they can process and understand language patterns more effectively.

# Import NLTK tokenizer 
import nltk
from nltk.tokenize import word_tokenize

# Tokenizing
sample_text = "Natural Language Processing with NLTK is fun and educational."
tokens = word_tokenize(sample_text)
print(tokens)
# Token list (tokens)
['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.']

Basically, you tokenize your documents every time you start an NLP task, laying the groundwork for all subsequent analysis and processing. One of the key reasons to tokenize text is to facilitate easier access to the corpus as a list of lists in Python. This structure greatly improves data manipulation capabilities, which is essential for the subsequent steps in your NLP pipeline, such as cleaning, normalizing, and transforming the text. Effective tokenization simplifies these processes, making it easier to apply further techniques that enhance the overall quality and interpretability of the data, ultimately leading to more accurate and insightful results in your NLP projects.

2. Cleaning text and stop words

While many may assume that handling string manipulation in Python is a common skill among data practitioners, text cleaning is an indispensable part of any NLP workflow, especially when preparing data for complex models. Recognizing this, I’ve included this section to emphasize its importance.

Cleaning text involves removing irrelevant characters, correcting typos, standardizing the format, and other preprocessing tasks that significantly impact the performance of your NLP model. Additionally, the removal of stop words — commonly used words in a language that are filtered out before processing — is a crucial step. Stop words like ‘the’, ‘is’, and ‘in’ usually carry little unique information and their removal can help in reducing the noise and focusing on more meaningful words in the text. This is particularly important for tasks like topic modeling or keyword extraction, where the emphasis is on identifying significant words.

Here’s a Python function designed to clean textual data, fixing several common inconsistencies:

import re
import string
from nltk.corpus import stopwords

def clean_text(text, remove_stopwords=True):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove numbers
text = re.sub(r'\d+', '', text)
# Remove extra whitespace
text = text.strip()
text = re.sub(r'\s+', ' ', text)

# Optionally remove stopwords
if remove_stopwords:
stop_words = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in stop_words])

return text

3. Stemming and Lemmatization

By default, we aim to extract the most semantically rich content from our corpus, the juices that can teach our models. This process, as we’ve seen, involves stripping away punctuation and stop words. Stemming and lemmatization processes cut the word itself, leaving the juices for the model to learn. But they do it differently:

Stemming involves paring down a word to its stem or root form, often by simply chopping off word endings. It’s a blunt-force approach that doesn’t involve understanding the context or meaning of the word, which can sometimes lead to incorrect generalizations, but it’s computationally efficient.

Lemmatization, in contrast, is a more nuanced approach that considers the morphological analysis of the words to return them to their dictionary form. It requires understanding the part of speech and the context in which a word is used. This means it’s more accurate but also more resource-intensive.

Example of Stemming and Lemmatization results

Conclusion

The transformation of text through tokenization, cleaning, and the reduction of words to their base forms via stemming and lemmatization is not merely a preparatory step but a critical process for the effective application of machine learning and other advanced computational techniques. These techniques “squeeze” the text to its semantic core, allowing NLP models to learn from the most relevant and informative content.

The next article in this series will introduce advanced text-processing techniques that are essential for any NLP beginner. I will cover the construction and utilization of n-grams, and parts-of-speech (POS) tagging and demonstrate how the TF-IDF matrix can be utilized to spotlight the word relevance and transform text into a numerical vector form suitable for machine learning algorithms.

--

--

Alexei Stepanov

Hi! I am Data Scientist and this is my blog, sometimes I'm expressing genuine curiosity in data science stuff here