NLP — Text Preprocessing (15/100)

4 min readApr 17, 2024

Ms Data Science and Business Analytics (ESSEC x Centralesupelec) (15/100)

In this article we will cover all the steps that can be used for text preprocessing before we make it pass through a model.

Lowercasing

Python is a case sensitive language and hence during tokenization two words with the same meaning might be treated differently.

Example - "Bright" and "bright" will get tokenized differently due to case 
sensitivity and so it creates redundancy in data and might confuse our
model.

To resolve this use we simply use the inbuilt lowercase function in python to lowercase all words in our text.

sentence.lower() or df['column'].str.lower()

Removing HTML Tags

Not removing HTML tags might also confuse our model so we use regex to create custom function that removes HTML tags.

import re
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'',text)

df['column'].apply(remove_html_tags)

Remove URLs

We again use regex to obtain custom functions that help us remove URLs.

def remove_url(text):
  pattern = re.compile(r'https?://\S+www\.\S+')
  return pattern.sub(r',text)

Remove Punctuation

Punctuations can lead to instability during tokenization where the system can tokenize punctuations in different ways and hence confusing the model with extra data.

Example - 

-> Hello! how are you ?
can be tokenized as ['Hello!' , 'how','are','you?']

Here our model would treat 'Hello!' differently from 'Hello' even though they
have the same meaning

import string, time 
exclude = string.punctuation

#Slower method
def remove_punc(text):
  for char in exclude:
    text = text.replace(char,'')
   return text 

#Faster method
def remove_punc1(text):
  return text.translate(str.maketrans('','',exclude)

Chat Word treatment

We often use short hands like rofl, lmao, lol, asap, etc. while communicating via text. We try to replace these by their actual meaning for easier processing by our model.

For our sample code we make a dictionary of slangs and their meanings from an existing source → https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt

def chat_conversion(text):
  new_text = []
  for w in text.split():
    if w.upper() in chat_words:
      new_text.append(chat_words[w.upper()]
    else:
      new_text.append(w)
  return "".join(new_text)

Spelling Correction

Same words in case of spelling mistakes might be treated as different during tokenization. (ex — Bright and Brght)

from textblob import TextBlob

# Incorrect sentence
sentence = "I havv goood speling"
print("Original sentence:", sentence)

# Create a TextBlob object
blob = TextBlob(sentence)

# Correct the sentence
corrected_sentence = blob.correct()
print("Corrected sentence:", corrected_sentence)

Remove Stop words

Words used for sentence formation but that don't add any meaning to the sentence (ex — a, and, the).

from nltk.corpus import stopwords
stopwords.word('english')

def remove_stopwords(text):
  new_text = []
  for word in text.split():
     if word in stopwords.words('english'):
        new_text.append('')
      else:
        new_text.append(word)
   x = new_text[:]
   new_text.clear()
   return "".join(x)

df['column'].apply(stopwords)

Handling Emojis

There are 2 ways to handle emojis — Either to remove them or to replace them by their meaning.

#Removing Emojis
import re

def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

clean_text = remove_emoji("Here is a text with an emoji 😊")
print(clean_text)  # Output should be: "Here is a text with an emoji "

#Replacing emojis by its meaning
import emoji

# Original text with emojis
text_with_emojis = "Hello there! 😊🚀"

# Convert emojis to their colon-separated text representation
demojized_text = emoji.demojize(text_with_emojis)

print(demojized_text)  
# Output: "Hello there! :smiling_face_with_smiling_eyes::rocket:"

Tokenization

Finally, we cover one of the most important part of text preprocessing, here we can use the following methods for tokenization depending on the complexity of our problem.

Split Function
Regular Expression
NLTK for tokenization
Spacy (best option)

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Process a text
text = "Hello, world! Here's a simple spaCy tokenization example."
doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]

print(tokens)

['Hello', ',', 'world', '!', 'Here', "'s", 'a', 'simple', 
'spaCy', 'tokenization', 'example', '.']

Stemming

Inflection — is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, gender, and mood.

Ex- walks, walked, walking (all have root form of walk)

Meanwhile stemming is the process of reducing the inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

Ex- Movie → Movi, Story → Stori etc. (this can be a problem)

There are two types of stemming: Porter stemmer (for english) and snowball stemmer (for other languages). We will take a look at Porter stemmer for now.

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

sample = "walk walks walking walked"
stemmed_text = stem_words(sample)
print(stemmed_text)

#Output will be -> walk,walk,walk,ealk

Lemmatization

Unlike stemming, lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In lemmatization root word is called Lemma. A lemma is a the canonical form, dictionary form or citation form of a set of words.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Initialize the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Example sentence
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# List of punctuations to remove from the sentence
punctuations = "?!.,;"

# Tokenize the sentence
sentence_words = nltk.word_tokenize(sentence)

# Remove punctuation from the list of words
sentence_words = [word for word in sentence_words if word not in punctuations]

# Display the words and their lemmatized forms
print("{0:20}{1:20}".format("Word", "Lemma"))
for word in sentence_words:
    print("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word)))

# The output will display each word along with its lemma

Note — Stemming is much faster as it is algorithmic, while Lemmatizaion is slow since it is basically searching through a dictionary. But in case we have to show the output to the user we should use lemmatization since it makes more logical sense to a user.