Step 1: A Comprehensive Guide to Text Cleaning

Gourav Didwania
4 min readSep 3, 2023

--

Photo by Sandy Millar on Unsplash

Why do the data scientist always have such clean and fresh text data? Well, all thanks to the new powerful “Text-o-Matic 3000” Washing Machine! J.K!

But jokes aside, cleaning text is a crucial part of the data science process. It’s like giving your data a nice bath to get rid of any dirt and grime. This makes your data more accurate and trustworthy, just like that fresh feeling after a good shower!

Text cleaning is super important in the world of natural language processing (NLP). It helps get rid of unnecessary stuff and gets your data all spick and span for analysis and modeling. In this article, we’ll dive into some common techniques for text cleaning and how you can use them on real-world data.

These techniques represent manual practices aimed at optimizing our text data for improved model performance. Let’s delve into them with a more detailed understanding:

1. Correction of Typos: Written text often contains errors, such as “Feninstead of “Fan.” To rectify these errors, a dictionary is employed to map words to their correct forms based on similarity. This process is known as typo correction. Still, its application is limited as in real-world data, manual correction is impossible.

2. Mapping and Replacement: This involves mapping words to standardized language equivalents. For instance, words like “b4” and “ttyl,” commonly understood by humans as “before” and “talk to you later,” pose challenges for machines. Normalization entails mapping such words to their standardized counterparts.

texting_abbreviations = {
"b4": "before",
"ttyl": "talk to you later",
"lol": "laugh out loud",
"brb": "be right back",
"omg": "oh my god",
"gr8": "great",
"idk": "I don't know"
}

text = "b4 we go, ttyl! lol"
for key in texting_abbreviations.keys():
text = text.replace(key, texting_abbreviations[key])

print(text)

## Output: before we go, talk to you later! laugh out loud

3. Expanding Contractions: Convert contractions like “don’t” to “do not” or “I’ll” to “I will.”

import contractions

text = "I'll be there, don't worry."
expanded_text = contractions.fix(text)
print(expanded_text)

## Output: I will be there, do not worry.

4. Removing Accents and Diacritics: Sometimes, people use accented characters like é, ö, etc. to signify emphasis on a particular letter during pronunciation. Normalize text by removing accents and diacritical marks from characters.

import unicodedata

text = "Café au Lait"
normalized_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')
print(normalized_text)

## Output: Cafe au Lait

5. Removing Extra Whitespace: Normalize text by removing extra spaces and leading/trailing spaces

text = "   Too    many   spaces   "
normalized_text = ' '.join(text.split())
print(normalized_text)

## Output: Too many spaces

6. Eliminating HTML Tags: In cases where raw text originates from sources such as web scraping or screen capture, it often carries along HTML tags. These tags introduce unwanted noise and contribute little to the comprehension and analysis of the text. Therefore, it becomes necessary to strip them.

from bs4 import BeautifulSoup

def remove_html_tags(text):
soup = BeautifulSoup(text, 'html.parser')
clean_text = soup.get_text()
return clean_text

html_text = "<p>This is <b>HTML</b> text.</p>"
cleaned_text = remove_html_tags(html_text)
print(cleaned_text)


## Output: This is HTML text.

7. Handling URLs: Frequently, individuals include URLs, particularly in social media content, to supplement context with additional information. However, URLs tend to vary across samples and can be considered noise.

import re 

def remove_url(text):
return re.sub(r"(https|http)?:\S*", "", text)

print(remove_url('using https://www.google.com/ as an example'))

## Output: using as an example

8. Handling Abbreviation: Detecting abbreviations in text using regular expressions can be challenging, as abbreviations can vary widely in format. However, I have developed a regular expression to cover all possible forms of abbreviations.

import re

text = "The U.S.A. and NASA are well-known abbreviations. i.e. , lots of abb. can be used in a sentence U.S.A."

def find_abbr(text):
abbr = list()
for i in re.finditer(r"([A-Za-z]+| )([A-Za-z]\.){2,}|\b[A-Z\.]+\b", text):
abbr.append(i.group())
return set(abbr) ## To Remove Duplicates

abbreviations = str(find_abbr(text))

print(abbreviations)

## Output: {' U.S.A.', ' i.e.', 'NASA'}

9. Case Standardising and Removing Special Characters: Special characters are non-alphanumeric characters. The characters like %,$,&, etc are special. In most NLP tasks, these characters add no value to text understanding and induce noise into algorithms. We can use regular expressions to remove special characters.

import re

text = "The $5,000 prize for the 'Best 3D-Printed masterpiece' went to John's team, led by @john_doe! Congratulations!"
text = text.lower()

expression = r"[^a-zA-Z0-9 ]"
cleaned_text = re.sub(expression, "", text)

print(cleaned_text)

## Output: the 5000 prize for the best 3dprinted masterpiece went to johns team led by johndoe congratulations

10. Removing Stopwords: In most cases, stopwords like I, am, me, etc. don’t add any information that can help in modeling. Keeping them in the text introduces unnecessary noise and can significantly increase the dimensionality of feature vectors, which can negatively impact both computation cost and model accuracy. Therefore, it is recommended to eliminate stopwords.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')

text = "This is an example sentence with some stopwords that we want to remove."

words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
cleaned_text = ' '.join(filtered_words)

print(cleaned_text)

## Output: example sentence stopwords want remove .

In addition to the methods discussed above, there can be numerous other text-cleaning requirements, such as removing mentions and hashtags, or encoded emojis. These too can be effectively addressed using the same techniques we’ve covered earlier.

Stay tuned for the next article in this series, where we’ll dive deeper into the world of Text Preprocessing. Until then, explore these versatile tools and techniques we’ve explored here to enhance your text data preparation skills.

--

--

Gourav Didwania

Data Scientist @ Ola 📈 | MLOps enthusiast 🤖 | Medium Blogger🖋️ | Let's dive into the world of AI together!💡 Collaborate at https://linktr.ee/gouravdidwania