Complete Guide to Text Preprocessing in NLP

6 min readJan 13, 2024

Text preprocessing is a crucial step while building NLP solutions and applications. Once we acquire some data for our project, the first thing we need to perform is text preprocessing to ensure that our data is suitable for input to our machine learning / deep learning model. It makes sure that the data is consistent, does not contain unnecessary things, and aligns with the project requirements. It is an important step in the NLP Pipeline.

Basic Steps in Text Preprocessing

Lowercasing
Remove HTML Tags
Remove URLs
Remove Punctuations
Chat word treatment
Spelling Correction
Removing stop words
Handling emojis
Tokenization
Stemming
Lemmatization

It is not necessary to apply all the above steps to our dataset. We need to apply common sense and make decisions based on our own judgment to apply proper text preprocessing as per the project requirements.

Let us first use some dataset on which we can apply these preprocessing steps. For this post, we will use the IMDB movies which can be found on Kaggle

import pandas as pd
import numpy as np
from IPython.display import display

data = pd.read_csv('Datasets/IMDB Dataset.csv')
display(data.head(10))

1. Lowercasing

Lowercasing involves converting all text to lowercase. This is crucial to ensure uniformity in text data, as it treats “Word” and “word” as the same entity. Lowercasing is often applied as a preliminary step to maintain consistency throughout the dataset.

#To lowercase a text in python we simply use str.lower
data['review'] = data['review'].str.lower()
display(data.head())

2. Remove HTML Tags

HTML tags, often present in scraped data, are irrelevant for training machine learning models. Removing these tags ensures that the model focuses on the actual textual content rather than HTML formatting.

import re

def remove_html_tag(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

data['review'] = data['review'].apply(remove_html_tag)
data.head()

3. Remove URLs

URLs in text data might not contribute valuable information to the model. Removing them aids in simplifying the data and avoiding potential confusion.

def remove_urls(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

# Example
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'
print(remove_urls(text4))

# To apply removing urls to any dataset, we can use apply function
data['review'] = data['review'].apply(remove_urls)

4. Remove Punctuations

Punctuation removal eliminates characters like commas and periods, focusing on the core words in the text. This step is beneficial for tasks such as sentiment analysis.

import string

exclude = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~'  # Exclude full stop to retain sentence structure
def remove_punctuation_optimized(text):
    return text.translate(str.maketrans('', '', exclude))

data['review'] = data['review'].apply(remove_punctuation_optimized)
data.head()

5. Chat Word Treatment

In many chat applications, chat words like “FYI(For Your Information)” or “IMHO(In My Honest Opinion)” can be ambiguous for models. Converting them to their full forms helps the model better comprehend the text.


# we found a repo on github that contains a dictionary of the words and their full forms. We will use this 
slang_dict = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'B4N': 'Bye For Now',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It\'s Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laughter',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'LOL': 'Laughing out loud',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': 'I don’t care',
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'IDC': 'I don’t care',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'LMAO': 'Laughing my a** off',
    'BFF': 'Best friends forever',
    'CSL': 'Can’t stop laughing'
}

def chat_conversion(text):
    new_text = [slang_dict[w.upper()] if w.upper() in slang_dict else w for w in text.split()]
    return ' '.join(new_text)

# Example
text_slang = "FYI I already knew"
chat_conversion(text_slang)

6. Spelling Correction

Spelling correction enhances the quality of textual data by fixing typos and inaccuracies. Utilizing libraries like TextBlob can aid in automatically correcting misspelled words.

from textblob import TextBlob

incorrect_text = 'texxt preprocesing is a curciaal stap in naturla languag procesing'
text_blob = TextBlob(incorrect_text)
text_blob.correct().string

7. Removing Stopwords

These are the words that help in sentence formation, but they contribute very little to the actual meaning of the sentence. For example: the, and, for etc. These words can be easily removed using the NLTK library. This step streamlines the data, retaining only significant words. However, there are some tasks where we keep the stop words, such as Parts of Speech Tagging.

from nltk.corpus import stopwords

def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stopwords.words('english')])

# Example
text = 'This is a really great time for the field of AI. It is advancing exponentially'
remove_stopwords(text)

8. Handling Emojis

Emojis contribute expressive content but can be challenging for models. Options include removing emojis entirely or replacing them with their meanings.

def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Example
remove_emoji("You are very funny 😂😂😂")

9. Tokenization

Tokenization involves breaking text into smaller units, such as words or sentences. It aids in preparing data for analysis, providing meaningful chunks.

from nltk.tokenize import word_tokenize, sent_tokenize

sent1= 'I am going to Mumbai'

print(word_tokenize(sent1))
print(sent_tokenize(sent1))

10. Stemming

Stemming is the process of reducing inflection(it means a same word is expressed in different grammatical categories such as tense, case, voice etc. For example walk,walking,walked or do,undo,doable, undoable etc ) in words to their root forms such as mapping group of words to the same stem even if the stem itself is not a valid word in the language

Stemming reduces words to their root form, helping in tasks like information retrieval. The NLTK library provides stemmers for this purpose.

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

sample = 'walk walks walking walked'
stem_words(sample)

11. Lemmatization

It’s essentially the same as stemming, but as we know, in the process of stemming, we sometimes get stem words that are not actual words in the english language. To solve this, we perform lemmatization, where the output for lemmatization is always a word from the dictionary. Lemmatization is slower as compared to stemming.

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Conclusion

In conclusion, mastering the NLP pipeline is fundamental for unleashing the true potential of text data in machine learning applications. Each preprocessing step plays a crucial role in refining raw text, ensuring it aligns with the specific requirements of your project. From lowercasing and removing HTML tags to handling emojis and lemmatization, these techniques collectively contribute to creating a robust and meaningful dataset. Tailoring these preprocessing steps to your project’s needs is key, as understanding the intricacies of textual data is pivotal for achieving optimal results in natural language processing tasks.