Comprehensive Guide to Text Cleaning in Python: Effective NLP Preprocessing with NLTK

Amruthjithraj V.R
Analytics Vidhya
Published in
3 min readJun 4, 2020

A simple code for text cleaning!

  1. Text Cleaning and its Importance:

Once the data has been acquired, it needs to be cleaned. Mostly, the data will contain duplicate entries, errors, or be inconsistent.Data pre-processing is an important step before applying any machine learning model. Same with the text data, before applying any machine learning model on text data, it requires data pre-processing. The pre-processing of text means cleaning of noise such as: removing stop words, punctuation, terms which doesn’t carry much weightage in context to the text, etc. In this article, we describe in detail how to pre-process text data for machine learning algorithms using Python(NLTK).

Without any further ado let’s dive into the code

2. Importing Important Libraries:

import reimport nltkfrom nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizernltk.download('stopwords')nltk.download('wordnet')

3. Using For loop for implementing all the text cleaning techniques in one go

corpus = []text_data = ""for i in range(0, 1732(Number of rows)):text_data = re.sub('[^a-zA-Z]', ' ', Raw_Data['Column_With_Text'][i])text_data = text_data.lower()text_data = text_data.split()wl = WordNetLemmatizer()text_data = [wl.lemmatize(word) for word in text_data if not word in set(stopwords.words('english'))]text_data = ' '.join(text_data)corpus.append(text_data)

4. Now let’s see what the for loop actually does

4.1. In the first step it will remove all terms other than English words. This step is essential because other terms in text data like special character and numbers can add noise to the data, which can adversely affect the performance of the machine learning model. A regular expression is used in this step to remove all non-English terms.

4.2. In the second step it will normalize the text data. normalizing the text is an essential step as this reduces the problem of dimensionality in the model. If the text is not normalized, it will lead to the problem of data duplication. For normalizing the text Lower () function in python is used. This function converts all the words into lower cases, which solves the problem.

4.3. In the third step it will lemetize the words. lemmatizing word is an essential step as this eliminated the problem of data duplications. Words with similar meaning such as work, working, and worked has the same meaning, but this will be considered as three different words while creating a bag of words model. WordNetLemmatizer package of NLTK library is used to tackle this problem. This package brings back any given words to its original form.

4.4. In the fourth step it will remove all the stop words. Removing stop words is an essential step because stop words add dimensionality to the model; this additional dimensionality affects the performance of the model. Stopword package in the NLTK library is used for removing stop words. All the text in the corpus is compared to the list of stop words, and if any word matches with the stop words list, it is then removed.

This article is for people who are starting with NLP and are stuck with text cleaning. Text cleaning can be a headache in most cases. This code can help you with the most basic text-cleaning techniques and can be used straight away.

Thank you for reading; Please consider following for more such blogs!

I hope you learned something new!

Cheers.

--

--