Text Preprocessing in Python

5 min readMay 26, 2024

📚Introduction

Natural Language Processing (NLP) has become a cornerstone of modern data science, allowing computers to understand, interpret, and respond to human language in a valuable way. Among the fundamental techniques in NLP are tokenization, stemming, lemmatization, stop words removal, and part-of-speech (POS) tagging. This blog will explore these essential processes and how they contribute to effective text analysis.

📚Table of Content

Install requirements
Dataset
Lower Case
Removing digits
Removing punctuations
Removing trailing whitespaces
Tokenizing
Stemming
Lemmatization
POS Tagging

📚Install requirements

# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install numpy==1.19.5
!pip install nltk==3.2.5
!pip install spacy==2.2.4

📚Dataset

#This will be our corpus which we will work on
corpus_original = "Need to finalize the demo corpus which will be used for this notebook and it should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"
corpus = "Need to finalize the demo corpus which will be used for this notebook & should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"

📚Lower Case

One of the simplest yet most important steps in text preprocessing is lowercasing. This step involves converting all the characters in the text to lowercase. Despite its simplicity, lowercasing plays a significant role in ensuring consistent and accurate text analysis


#lower case the corpus
corpus = corpus.lower()
print(corpus)

📚Removing digits

Digits and numbers can add noise to the text data, especially when they are not relevant to the analysis. Removing them can help in focusing on the textual content that carries meaningful information

#removing digits in the corpus
import re
corpus = re.sub(r'\d+','', corpus)
print(corpus)

📚Removing punctuations

Punctuation marks, while essential for human readability, often do not contribute to the meaning in the context of text analysis and can be considered noise in many NLP tasks. Removing punctuation is a common preprocessing step that helps in cleaning and standardizing the text data.

#removing punctuations
import string
corpus = corpus.translate(str.maketrans('', '', string.punctuation))
print(corpus)

📚Removing trailing whitespaces

Trailing whitespace can make the text data appear messy and unstructured. Removing it helps in creating a cleaner dataset, which is easier to process and analyze,

# Sample text with trailing whitespace
text = "This is an example sentence with trailing spaces.   "


corpus = ' '.join([token for token in corpus.split()])
corpus
!python -m spacy download en_core_web_sm

📚Tokenizing

Tokenization is one of the most fundamental steps in Natural Language Processing (NLP). It involves breaking down a text into smaller units called tokens, which can be words, phrases, or even punctuation marks.

from pprint import pprint
##NLTK
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words_nltk = set(stopwords.words('english'))

tokenized_corpus_nltk = word_tokenize(corpus)
print("\nNLTK\nTokenized corpus:",tokenized_corpus_nltk)
tokenized_corpus_without_stopwords = [i for i in tokenized_corpus_nltk if not i in stop_words_nltk]
print("Tokenized corpus without stopwords:",tokenized_corpus_without_stopwords)

##SPACY
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
spacy_model = spacy.load('en_core_web_sm')

stopwords_spacy = spacy_model.Defaults.stop_words
print("\nSpacy:")
tokenized_corpus_spacy = word_tokenize(corpus)
print("Tokenized Corpus:",tokenized_corpus_spacy)
tokens_without_sw= [word for word in tokenized_corpus_spacy if not word in stopwords_spacy]
print("Tokenized corpus without stopwords",tokens_without_sw)

print("Difference between NLTK and spaCy output:\n",
      set(tokenized_corpus_without_stopwords)-set(tokens_without_sw))

📚Stemming

Stemming helps in reducing different forms of a word to a common base form. For example, “running,” “runs,” and “ran” can all be reduced to “run.” This normalization helps in treating related words as a single token.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()

print("Before Stemming:")
print(corpus)

print("After Stemming:")
for word in tokenized_corpus_nltk:
    print(stemmer.stem(word),end=" ")

📚Lemmatization

Lemmatization is a critical text preprocessing step in Natural Language Processing (NLP) that reduces words to their base or dictionary form, known as the lemma

import nltk
nltk.download('wordnet')


!python3 -m nltk.downloader wordnet

!pip install --upgrade nltk

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

for word in tokenized_corpus_nltk:
    print(lemmatizer.lemmatize(word),end=" ")

📚POS Tagging

Part-of-Speech (POS) tagging is a crucial step in Natural Language Processing (NLP) that involves labeling words in a text with their respective parts of speech, such as nouns, verbs, adjectives, adverbs, etc.

#POS tagging using spacy
print("POS Tagging using spacy:")
doc = spacy_model(corpus_original)
# Token and Tag
for token in doc:
    print(token,":", token.pos_)

#pos tagging using nltk
nltk.download('averaged_perceptron_tagger')
print("POS Tagging using NLTK:")
pprint(nltk.pos_tag(word_tokenize(corpus_original)))

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Machine Learning projects course

🔍 Explore Free world top University computer Vision ,NLP, Machine Learning , Deep Learning , Time Series and Python Projects, access insightful slides and source code, and tap into a wealth of free online websites, github repository related Machine Learning Projects. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Machine Learning projects potential!”

Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️

📚GitHub Repository

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter