Text Preprocessing — NLP Basics

Published in

Analytics Vidhya

8 min readJul 15, 2020

Text Preprocessing is the first step in the pipeline of Natural Language Processing (NLP), with potential impact in its final process. Text Preprocessing is the process of bringing the text into a form that is predictable and analyzable for a specific task. A task is the combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a task. The main objective of text preprocessing is to break the text into a form that machine learning algorithms can digest. In this report, we will perform the task of text preprocessing on a corpus of toxic comments and categorize the comments based on different types of toxicity.

We will be using the dataset given in the link below:

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Text Preprocessing Techniques

There are different ways to preprocess your text. Here are some of the techniques listed below which help in preprocessing the input text.

Noise removal

Noise removal is about removing digits, characters, and pieces of text that interfere with the process of text analysis. It is one of the most important steps of the text preprocessing. It is highly domain-dependent. For example, in the tweets data, noise could be all the special characters except the hashtags as it signifies a concept that can characterize a tweet. The problem with noise is that it can produce inconsistent results if noisy, i.e., if uncleaned data is fed to the machine learning models.

There are various ways to remove noise. This includes punctuation removal, special character removal, numbers removal, HTML formatting removal, domain-specific keyword removal (e.g. ‘RT’ for retweet), source code removal, header removal, and more. It all depends on which domain you are working and what categorizes as noise for your task.

Tokenization

Tokenization is defined as a process to split the text into smaller units, i.e., tokens, perhaps at the same time throwing away certain characters, such as punctuation. Tokens could be words, numbers, symbols, n-grams, or characters. N-grams are a combination of n words or characters together. Tokenization does this task by locating word boundaries.

Input: Friends, Romans Countrymen, lend me your ears
Output: [‘Friends’, ‘ , ’, ‘Romans’, ‘ , ’, ‘Countrymen’,‘ , ’, ‘lend’, ‘me’, ‘your’, ‘ears’]

The most widely used tokenization process is white space tokenization. In this process, the entire text is split into tokens by splitting them based on whitespace between two words.

The first task we perform on data is to split the comments into smaller units called tokens that could be words, numbers, or symbols. After splitting the text into tokens, we count the number of each type of token.

From nltk.tokenize we can import word_tokenize to perform the task of tokenization.

from nltk.tokenize import word_tokenize
sentence = "Hello, I am Nupur"
tokens = word_tokenize(sentence)
print(tokens)Output = ['Hello', ',', 'I', 'am', 'Nupur']

Frequency of each token present in the dataset using wordcloud package:

Limitations of Tokenization

Challenges in tokenization depend on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by space. Languages such as Chinese and Thai are said to be unsegmented as words do not have clear boundaries. Tokenizing the unsegmented language requires additional lexical and morphological information. Tokenization is also affected by writing systems. Structures of languages can be grouped into three categories:

Isolating: Words do not divide into smaller units. Example: Mandarin

Agglutinative: Words divide into smaller units. Example: Japanese, Tamil

Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin

Lowercasing

This is the simplest technique of text preprocessing which consists of lowercasing every single token of the input text. It helps in dealing with sparsity issues in the dataset. For example, a text is having mixed-case occurrences of the token ‘Canada’, i.e., at some places token ‘canada’ is and in other ‘Canada’ is used. To eliminate this variation, so that it does not cause further problems, we use the lowercasing technique to eliminate the sparsity issue and reduce the vocabulary size.

Despite its excellence in reducing sparsity issues and vocabulary size, it sometimes impacts the system’s performance by increasing ambiguity. For example, ‘Apple is the best company for smartphones ‘. Here when we perform lowercasing, Apple is transformed into apple and this creates ambiguity as the model is unaware whether apple is a company or a fruit and there are higher chances that it may interpret apple as a fruit.

In the given dataset, we perform the task of lowercasing after tokenization and lowercase all the tokens.

lowercase_words = []
for word in tokens:
   word = word.lower()
   lowercase_words.append(word)

Normalization

Normalization is the process of converting the token into its basic form (morpheme). Inflection is removed from the token to get the base form of the word. It helps in reducing the number of unique tokens and redundancy in the data. It reduces the data dimensionality and removes the variation of a word from the text.

There are two techniques to perform normalization. They are Stemming and Lemmatization.

Stemming:

Stemming is the elementary rule-based process of removal of inflectional forms from a token. The token is converted into its root form. For example, the word ‘troubled’ is converted into ‘trouble’ after performing stemming.

There are different algorithms for stemming but the most common algorithm, which is also known to be empirically effective for English, is Porter’s Algorithm. Porter’s Algorithm consists of 5 phases of word reductions applied sequentially.

Since stemming follows a crude heuristic approach that chops off the end of the tokens in the hope of correctly transforming into its root form, it may sometimes generate non-meaningful terms. For example, it may convert the token ‘increase’ into ‘increas’, causing the token to lose its meaning.

Stemming has two types of errors — over-stemming and under-stemming. Over-stemming refers to the problem where two words with different stems are stemmed to the same root. This is also known as a false positive. Under-stemming is the situation where two words with the same stem are not stemmed together. This is also known as a false negative. Light stemming tends to reduce over-stemming errors but increases the under-stemming errors whereas heavy stemming increases over-stemming errors but reduces under-stemming errors.

NLTK package has a PorterStemmer class for stemming of words.

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
for word in lowercase_words:
    tokens = ps.stem(word)
    print(tokens)

Lemmatization:

Lemmatization is similar to stemming, the difference being that lemmatization refers to doing things properly with the use of vocabulary and morphological analysis of words, aiming to remove inflections from the word and to return the base or dictionary form of that word, also known as the lemma. It does a full morphological analysis of the word to accurately identify the lemma for each word. It may use a dictionary such as a Wordnet for mapping or some other rule-based approaches.

from nltk.stem import WordNetLemmatizer
wml = WordNetLemmatizer()
lemma = []for word in lowercase_words:
    tokens = wml.lemmatize(word)
    lemma.append(tokens)

For example, if the token given for lemmatization is ‘increase’, it returns ‘increase’ as its lemma whereas stemming returns ‘increas’.

Though lemmatization proves to be better than stemming, either form of normalization does not tend to improve English IR performance in aggregate. In some cases, it proves to be useful while in other cases it hampers the performance.

Stop-word removal

Stop-words are commonly used words in a language. Examples are ‘a’, ’an’, ’the’, ’is’, ’what’ etc. Stop-words are removed from the text so that we can concentrate on more important words and prevent stop-words from being analyzed. If we search ‘what is text preprocessing’, we want to focus more on ‘text preprocessing’ rather than ‘what is’.

Stop words can mean different things for different applications. In some applications, removing all stop words from determiners to preposition is appropriate. But in some applications, like sentimental analysis, removal of tokens like not, good, etc. can throw algorithms off their tracks.

From the comments dataset, we will remove all the stop words keeping in mind not to remove stop words like not or good, since these words are crucial for toxicity analysis for our corpus.

from nltk.corpus import stopwords
filter_words = []
Stopwords = set(stopwords.words('english'))for word in lemma:
    if word not in Stopwords:
         filter_words.appemd(word)

After removing stop-words from the dataset:

Object Standardization

Text data often contain words and phrases which are not present in any lexical dictionaries. If your application does not benefit from these words and is just leading to sparsity issues, you can consider removing these words from the datasets.

Some of the examples are- acronyms, hashtags attached with the words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed.

Removing punctuation

The next step is to remove punctuations as they are no value addition to the model. Removing the punctuations will help in reducing the size of the training set.

We will remove the punctuations like commas and full stops from the comments as it doesn’t add any extra information while treating the text data.

We can use regular expressions to remove all the punctuations from the comments by providing a set of punctuations so that they can be removed from the text whenever any of the listed punctuations are encountered.

Removing whitespaces

After removing the punctuations, we remove all the whitespaces in the text data as they are useless and only increases the size of the training set. We will remove all the whitespaces from the comments, keeping only those tokens that contribute towards the toxicity analysis of the corpus.

In the next blog, we will discuss more about NLP.

Stay tuned and Happy Learning!