Text Preprocessing in Python

Published in

Deep Data Science

4 min readApr 28, 2019

[ This is part of 100 Days of ML ]

Machine learning algorithm needs data in the numeric form, but the data usually has mixed features like numeric, text, categorical, ordinal. That’s why we need to process the text first before feeding into the machine learning models. In this post, I have taken dataset from kaggle, the dataset contains the text and other features of blogs from blogger.com. Check the dataset:

Dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

Code can be found at the end of this post. This is how data looks:

We are only considering the text feature from the dataset which is blog posts content. I will apply text preprocessing techniques on this text feature and show the results. Let’s begin!

Lower casing — Change all the words into lower case or uppercase so that any duplication can be avoided. Because “Python” and “python” will be considered as 2 separate words if this step is not done. Data after lowercasing:

2. Remove punctuations — Punctuations are the symbols like ‘?’ , ‘,’ , ‘!’ etc. which support the readability and understandability of the English language. However, we don’t need them here, so remove them. We will remove all the other characters and just keep words and spaces. This step should be done after feature extraction like hashtags, user tagged, see here for details.

3. Remove stop words — ‘the’, ‘a’, ‘in’ etc. words are stop words, which have no significant meaning that we can gain from the text. These are also most commonly occurring words in the text and if left untreated, they may create irrelevant bais to our machine learning models. Hence, better remove them. Text after stop word filtered:

4. Remove frequent words — Apart from stop words, there are some other words also present in the text which occur more frequently than others.

The intuition here is if some words are occurring in all the instances of text then they are not contributing to our classification task (if this is what you are doing). If the same thing is present in all the instances then it can’t be used for differentiating among classes. The frequent occurring word can be seen on the left.

To remove these words, first, we will take whole text and split into words and then calculate their frequency. Then we can select how many words we want to remove and filter them out.

5. Remove rarely occurring words — Rarely occurring words are the words which occur only a few times in the whole dataset. If we are calculating bag of words, or TF-IDF features from the text for training the model, these words will only make our dataset sparse and slow down models performance. So filter out these words as well. Here’s a glimpse of some words which are only occurring one time in the entire dataset.

6. Remove whitespaces — Removing whitespaces is a data cleaning step which intends to remove all the whitespaces present in the text.

7. Remove Numeric — Depending on the problem you want to solve numeric values present in the data may or may not contribute to your models. So be careful while performing this step. After removing numeric features:

Note :- You can also remove “None/none” or some other values which are not relevant to your problem in similar ways mentioned above.

8. Spelling correction — Spelling mistakes is one of the most common issues found in the text data. Typos, shortcuts makes this problem even worst. Spelling correction can be dealt with a powerful python based library TextBlob. Which can also be used for stemming and lemmatization. Spelling correction also helps remove duplication that created by spelling mistakes like python, python, pythom etc. Check out the code for performing spelling correction.

Warning! this step might take too much time (sometimes hours, depending on the data size and system performance)

9. Lemmatization — It is preferred over stemming because it finds the root word and avoids duplication in the text. Text after lemmatization:

Find the code here.

Text Preprocessing in Python

Written by Adarsh Verma