Beginner’s Guide to Text Preprocessing in Python

Yasmeen Hitti

Published in

BiaslyAI

5 min readJan 31, 2019

What is Text Preprocessing?

Text preprocessing is a step that occurs after text mining. Text data can be sourced from difference places; text can come from online books, text can be web scraped and it may also come from online documentation. Text preprocessing is essential in order to further manipulate your text data. In natural language processing, one thing to keep in mind is that whatever you do to the raw data may have an impact on how your model will be trained. For example, stripping punctuation and removing upper cases may change the meaning of your sentences. This is something to keep in mind while going through your data and what you want to have as an end result.

Imports

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwordsimport pandas as pd
import numpy as np
import re                      # Regular expressions

NLTK is a suite of libraries which will help tokenize (break down) text into desired pieces of information (words and sentences). The nltk.stem package will allow for stemming and lemmatization (normalization techniques). Both NumPy and Pandas are imported in case you have a preference when manipulating your data. If you would like more information about NumPy and Pandas, click here. Finally, regular expressions are imported to help us identify/remove specific characters.

Import Data

Df_pd = pd.read_csv("C:/Your/files/textdata.csv",encoding = 'utf-8', header = None)
Df_np = np.asarray(pd.read_csv("C:/Your/files/textdata.csv", encoding ='utf-8' header = None)

Two options are shown here to import your file into python. Although they look the same since they both utilize pd.read_csv, the manipulations will be different further downstream. The encoding in this example is utf-8, this encoding is good to follow when processing text which is taken from online sources since you may be dealing with more than 256 possible characters ( 256 is the limit of the single-byte latin-1 encoding). If you do not know what encoding your file was saved in, please click here.

Sentence Tokenization

The meaning of tokenization is to chop up some existing text into smaller chunks. For example, a paragraph can be tokenized into sentences and further into words. We will first start with a simple string that we would like to divide into sentences.

paragraph = " Hello world! I am going for coffee before work! "
sentences = sent_tokenize(paragraph)
print(sentences)

If you are preprocessing a panda data frame, you will want to loop the tokenization procedure over all your rows. This is achieved by creating a function. If you are preprocessing a NumPy array, you will want to loop for each element in the array in order to tokenize the sentences.

def tokenization_s(sentences): # same can be achieved for words tokens
    s_new = []
    for sent in (sentences[:][0]): #For NumpY = sentences[:]
        s_token = sent_tokenize(sent)
        if s_token != '':
            s_new.append(s_token)
    return s_new

Regular Expressions

This function will return a clean version of your text. For people using panda, each row will contain a clean version of the text that was in each row. For NumPy users, the function will return a clean text element in the array. If you need more information on regular expressions, click here.

def preprocess(text):
    clean_data = []
    for x in (text[:][0]): #this is Df_pd for Df_np (text[:])
        new_text = re.sub('<.*?>', '', x)   # remove HTML tags
        new_text = re.sub(r'[^\w\s]', '', new_text) # remove punc.
        new_text = re.sub(r'\d+','',new_text)# remove numbers
        new_text = new_text.lower() # lower case, .upper() for upper          
        if new_text != '':
            clean_data.append(new_text)
    return clean_data

Word Tokenization

Same principal applies as the sentence tokenizer, here we use word_tokenize from the nltk.tokenize package. First we will tokenize words from a simple string.

sentences = " Hello world! I am going for coffee before work! "
words = word_tokenize(sentences)
print(words)

It is important to consider in which order you perform your preprocessing steps; the preprocess function was utilized before the word tokenization function in order to avoid punctuation to be stored as a word. If your data is stored in pandas you will want to create a function to loop over every row in your data frame.

def tokenization_w(words):
    w_new = []
    for w in (words[:][0]):  # for NumPy = words[:]
        w_token = word_tokenize(w)
        if w_token != '':
            w_new.append(w_token)
    return w_new

Stemming

Stemming is used to normalize parts of text data. What does this mean exactly? When you are using a verb which is conjugated in multiple tenses throughout a document you would like to process, stemming will shorten all of these conjugated verbs to the shortest length of characters possible; it will preserve the root of the verb in this case. Stemming is done for all types of words, adjectives and more (which have the same root).

In this tutorial we will use the SnowBallStemmer from the nltk.stem package. Stemming can also be achieved with the PorterStemmer. This example is once again performed on a panda frame.

snowball = SnowballStemmer(language = 'english')
def stemming(words):
    new = []
    stem_words = [snowball.stem(x) for x in (words[:][0])]
    new.append(stem_words)
    return new

Let’s run the function to show the output of this function:

test = ['You like to eat apples. He has eaten many apples because he likes eating.']
test_pd = pd.DataFrame(test) #makes this into a panda data frame
clean_test = preprocess(test_pd) #removes punctuation, see above
clean_words = tokenization_w(clean_test) # word tokenization
stem_test = stemming(clean_words) # stemming similar words
print(stem_test)

What we can observe is no punctuation, no upper cases and each word is stored individually (tokenized). Stemming occurred on the verb “to eat” , the word because and apple.

Lemmatization

Lemmatization is another normalization technique which is used in natural language processing. The linguistic difference with respect to stemming is that lemmatization will enable for words which do not have the same root to be grouped together in order for them to be processed as one item.

lemmatizer = WordNetLemmatizer()
def lemmatization(words):
    new = []
    lem_words = [lemmatizer.lemmatize(x) for x in (words[:][0])]
    new.append(lem_words)
    return new

If we use test as our input variable for the lemmatization function we will see what has been grouped together and how the result differs from stemming.

lemtest = lemmatization(clean_words)
print(lemtest)