NLP — Text PreProcessing (Part 1)

Published in

The Deep Hub

8 min readFeb 15, 2024

Introduction to Text PreProcessing

Once upon a time in the vast realm of computers, there existed a treasure trove of information — data! This data, like magical potions, came in various forms, including numeric and text. Our quest was to unlock the secrets hidden within the text data, but little did we know, the path was riddled with challenges.

As we delved into the enchanting world of text data, we noticed it wasn’t all sunshine and rainbows. The data was, well, a bit messy. Punctuations, URLs, and other curious elements adorned the text like unexpected obstacles on our journey.

The Mystery of Clean Data: “Is this data clean?” we wondered. Clean data, like a well-kept library, is essential for understanding and analysis. However, our text data was more like a dusty ancient manuscript, full of quirks.

Why do we need text preprocessing?

Data quality significantly influences the performance of a machine-learning model. Inadequate or low-quality data can lead to lower accuracy and effectiveness of the model
Text preprocessing aims to remove or reduce the noise and variability in text data and make it more uniform and structured. This can help NLP models to focus on the meaningful and relevant information in the text and improve their efficiency and effectiveness.

What are some common text preprocessing techniques?

Some or all of these text preprocessing techniques are commonly used by NLP systems. The order in which these techniques are applied may vary depending on the needs of the project.

Text Removal

Lowercasing

Imagine a magical spell that changes words like “HELLO” to “hello” and “GoOD mOrNinG” to “good morning.” Lowercasing is the hero that makes all words bow before its consistency.

In our Python spellbook, we found a simple charm to transform words to lowercase:

Sample Code:

text = "The Magic Of LoWErCasIng is SimPLe yEt PoWErFUL!"
lowercased_text = text.lower()
print(lowercased_text)

Output: “the magic of lowercasing is simple yet powerful!”

Why Lowercasing Matters:

Consistency: It brings uniformity to words, making them easily comparable.
Analysis: Lowercasing ensures that the computer sees “apple” and “Apple” as the same, preventing confusion in analyses.

Where to Use Lowercasing:

Text Processing: Before tokenization or any text analysis, to treat words uniformly.
Comparisons: When comparing strings, ensuring case-insensitive matching.

When to Be Cautious: While lowercasing is a trusty ally, be mindful of its limitations. In certain contexts, like sentiment analysis where “GREAT” and “great” may carry different meanings, preserving case sensitivity could be crucial.

Remove Punctuations

Picture a sentence, adorned with dots, commas, exclamation points, and question marks. While these symbols add flair to the language, they can be distracting when our goal is to focus on the words themselves.

In Python, a powerful incantation using regular expressions helps us rid our text of these mischievous symbols:

import re
text = "The dragon soared through the sky, breathing fire! What an incredible sight."
cleaned_text = re.sub(r'[^\\w\\s]', '', text)
print(cleaned_text)

Output: “The dragon soared through the sky breathing fire What an incredible sight”

Understanding the Spell:

The Incantation (re.sub): This is like the magic wand. It replaces patterns in the text with something else.
The Pattern ([^\\w\\s]): This is the magical rune specifying what to find. Here, it says, "Find anything that is not a word (\w) or space (\s)."
The Replacement (''): This is what we replace the found patterns with. In our case, with nothing – an empty space.

Why Remove Punctuations:

Focus on Words: By banishing punctuations, we allow the words to take center stage in our analysis.
Simplify Analysis: For tasks like tokenization or counting words, removing distractions makes the process smoother.

When to Use this Spell:

Text Preprocessing: Before diving into advanced text analysis, it’s wise to clear the stage of unnecessary symbols.
Word Frequency Count: If you’re counting how often words appear, removing punctuations ensures accurate counts.

When to Be Cautious:

While this spell works wonders for many situations, be cautious in creative writing or poetry analysis where punctuation might carry significant meaning.

Remove Special Characters

Imagine a sentence adorned with symbols like @, #, $, % — these are the special characters that, while fascinating, can create chaos when we seek clarity and simplicity in our text.

In Python, we wield the magic wand of regular expressions to cast a spell that banishes these special characters:

import re
text = "The magic@l land of NLP! ✨ Removing $pecial characters is crucial for clarity."
cleaned_text = re.sub(r'[^a-zA-Z0-9\\s]', '', text)
print(cleaned_text)

Output: “The magical land of NLP Removing special characters is crucial for clarity”

Understanding the Spell:

The Incantation (re.sub): This is our magic wand, replacing patterns in the text.
The Pattern ([^a-zA-Z0-9\\s]): This magical rune says, "Find anything that is not a letter (a-z, A-Z), number (0-9), or space (\s)."
The Replacement (''): We replace the found patterns with nothing, vanquishing them from our text.

Why Remove Special Characters:

Text Purity: Cleansing the text of special characters ensures a pristine canvas for analysis.
Preventing Distractions: For tasks like sentiment analysis or word frequency, eliminating unnecessary symbols keeps the focus on the essence of the words.

When to Use this Spell:

Data Cleaning: Before diving into any text analysis, it’s wise to purify the text of unwanted symbols.
Machine Learning Models: When preparing text data for training models, removing special characters helps in creating cleaner features.

When to Be Cautious:

Exercise caution if special characters convey essential information, such as in emoticons or domain-specific symbols.

Remove URLs

Imagine a narrative adorned with web links, pointing to distant lands and realms beyond the text. While URLs serve a purpose in the online world, they can be distractions when we seek to focus solely on the words within.

In Python, we summon the magic of regular expressions to cast a spell that banishes URLs from our text:

import re
text = "Exploring the magic of NLP! Check out more at <https://magicaltexts.com>. #NLP #Magic"
cleaned_text = re.sub(r'http\\S+|www\\S+|https\\S+', '', text)
print(cleaned_text)

Output: “Exploring the magic of NLP! Check out more at . #NLP #Magic”

Understanding the Spell:

The Incantation (re.sub): Our magic wand, replacing patterns in the text.
The Pattern (http\\S+|www\\S+|https\\S+): This mystical rune says, "Find anything that starts with http, www, or https, followed by any non-whitespace characters (\S+)."
The Replacement (''): We replace the found patterns with nothing, erasing the URLs from our text.

Why Remove URLs:

Focus on Content: By banishing URLs, we keep our attention on the core content of the text.
Clean Text Analysis: For tasks like sentiment analysis or keyword extraction, eliminating URLs simplifies the analysis.

When to Use this Spell:

Text Preprocessing: Before embarking on any NLP adventure, it’s wise to clear the text of distracting URLs.
Word Frequency Analysis: When counting word occurrences, removing URLs ensures accurate counts.

When to Be Cautious:

Exercise caution when URLs convey essential information, such as in academic texts or references.

Removal of HTML Tags

Imagine a tale, beautifully written but marred by the presence of HTML tags — those invisible scripts that structure content on the web. While crucial for webpages, they are distractions when we aim to immerse ourselves solely in the language.

In Python, we wield the magic wand of regular expressions to cast a spell that banishes HTML tags from our text:

import re
html_text = "<p>In a land far, far away...</p> <a href='<https://example.com>'>Explore</a> the magic!"
cleaned_text = re.sub(r'<.*?>', '', html_text)
print(cleaned_text)

Output: “In a land far, far away… Explore the magic!”

Understanding the Spell:

The Incantation (re.sub): Our magic wand, replacing patterns in the text.
The Pattern (<.*?>): This mystical rune says, "Find anything between < and > (HTML tags), as few characters as possible (*?).""
The Replacement (''): We replace the found patterns with nothing, erasing the HTML tags from our text.

Why Remove HTML Tags:

Focus on Pure Text: Banishing HTML tags ensures our text is free from the web’s structural elements.
Text Analysis Purity: For tasks like sentiment analysis or language modeling, eliminating HTML tags simplifies the analysis.

When to Use this Spell:

Web Scraping Results: When extracting text from web pages, removing HTML tags is crucial for clean data.
Preprocessing for NLP: Before diving into Natural Language Processing, cleanse your text from unwanted HTML artifacts.

When to Be Cautious:

Exercise caution when HTML tags convey essential information, such as in academic texts or references.

Removal of Stopwords

Imagine a story, rich and vibrant, yet cluttered with words like “the,” “and,” “is” — the humble stopwords. While essential for language, they can be distractions when we seek to unravel the essence of a narrative.

In Python, we summon the magic of libraries like NLTK to cast a spell that banishes stopwords from our text:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "In the mystical land, the wise wizard and the brave knight embarked on a daring quest."
stop_words = set(stopwords.words('english'))
tokenized_text = word_tokenize(text)
filtered_text = [word for word in tokenized_text if word.lower() not in stop_words]
print(filtered_text)

Output: ′mystical′,′land′,′,′,′wise′,′wizard′,′brave′,′knight′,′embarked′,′daring′,′quest′,′.′

Understanding the Spell:

The Incantation (stopwords.words('english')): This conjures a list of common English stopwords.
The Magical Charm (word.lower() not in stop_words): This is the enchantment itself, ensuring we keep only those words that aren't stopwords.

Why Remove Stopwords:

Focus on Significance: By banishing stopwords, we spotlight words with more substantial meanings, bringing clarity to our analysis.
Efficient Analysis: For tasks like sentiment analysis or keyword extraction, eliminating stopwords streamlines the process.

When to Use this Spell:

Text Preprocessing: Before embarking on any NLP journey, cleanse your text of distracting stopwords.
Word Frequency Analysis: When counting word occurrences, removing stopwords ensures accurate and meaningful counts.

When to Be Cautious:

Exercise caution when stopwords carry essential meaning, such as in specific domains or certain analyses where these words hold significance.

Certainly! In the realm of Natural Language Processing (NLP), the process of text preprocessing involves several techniques to refine and prepare textual data for analysis. Beyond the previously mentioned tasks like removing punctuation, URLs, HTML tags, and stopwords, here are a few more aspects to consider:

Removing Numbers: Numbers might not always contribute meaning to certain analyses.

import re \
text = "There are 5 apples on the tree." 
cleaned_text = re.sub(r'\\d+', '', text) 
print(cleaned_text)

Output: “There are apples on the tree.”

2. Removing Extra Whitespaces: Extra spaces can affect the accuracy of text analyses.

text = "   Too      many     spaces!" 
cleaned_text = ' '.join(text.split()) 
print(cleaned_text)

Output: “Too many spaces!”

3. Handling Contractions: To ensure consistent representation of words like “don’t” or “can’t.”

from contractions import contractions_dict 
import re 
text = "I don't know what's happening." 
cleaned_text = ' '.join([contractions_dict.get(word, word) for word in text.split()]) 
print(cleaned_text)

Output: “I do not know what is happening.”

4. Handling Emoticons or Emoji: Depending on the analysis, it might be essential to either preserve or remove these symbols.

import emoji text = "Feeling 😊 today!" 
cleaned_text = emoji.demojize(text) 
print(cleaned_text)

Output:“Feeling :smiling_face_with_smiling_eyes: today!”

5. Handling Special Characters: Addressing special characters that might hold significance in certain contexts.

import string 
text = "Let's go #NLP!" 
cleaned_text = ''.join([char for char in text if char not in string.punctuation]) 
print(cleaned_text)

Output: “Lets go NLP”

Summary

Text preprocessing in NLP is often task-specific, and the techniques employed may vary based on the objectives of the analysis. Combining these methods creates a harmonious text, ready for the magical adventures of Natural Language Processing.

NLP — Text PreProcessing (Part 1)

Introduction to Text PreProcessing

Why do we need text preprocessing?

What are some common text preprocessing techniques?

Text Removal

Lowercasing

Remove Punctuations

Remove Special Characters

Remove URLs

Removal of HTML Tags

Removal of Stopwords

Summary

Written by Chandu Aki