Preprocessing in Text Analytics

Published in

Tokopedia Data

7 min readNov 7, 2018

It is a common knowledge that text does exist everywhere in every aspect of human being, such as newspaper, SMS, Wikipedia, or even chat conversation within our chat apps. Huge amount of text is produced everyday and lots of information and insights can be obtained from it.

Traditional Data Analysis is typically based on a relational model in which data is stored in the form of what we call as table. We call this form of data as structured data. However, only an approximate of 20% data is stored as structured data, so the rest of them is what we call as unstructured data (refers to: Big Data for Dummies Book).

Text is one of many forms of unstructured data. If we say that structured data is “big”, then unstructured data is definitely “huge”. Handling this type of data is not as easy as handling structured data. Computers are good at handling and processing structured data, because structured data is basically number at its core — how many times a page was visited, how long someone was on your site, what products they bought, where they came in from. However, what kind of interpretation can computer get from raw text?

Many businesses do some different approaches for this matter, one of them is applying brute forces and manual labor to read raw text manually, interpret emotion and sentiment, and then convert it into structured data as well as put it into visualization. Unfortunately, brute force is wasting time and not efficient.

Given this background, a concept called Text Analytics was born.

Text Analytics is a practice or set of processes of converting unstructured text data into meaningful data for analysis process.

As we can see, there is a word ‘converting’ in the definition above. So the question is, why should we convert the data? Why can’t we just feed the data to computer? The answer is, because computer is a machine that does not have the capability to know explicitly how to process text or words.

Then, how do we process the data so it can be processed by computer as machine? To overcome this, we should know how the data processed in something we call pipeline. In general, Text Analytics Pipeline can be illustrated as follow.

As shown in the picture above, to get understandable insights and knowledge from text, there are several steps that should be taken: Preprocessing, Feature Extraction, Processing into Algorithm or Machine Learning Pipeline and Analytics Process. And at the end of the process, we can gain knowledge that allows us to do several things such as, sentiment analysis for product review or customer complaint, social media monitoring, email spam filter, or text summarization.

In this post, we will only cover the basic fundamental of Preprocessing in Text Data.

Text Preprocessing

Amazing things we can do with text analytics do not happen like magic. Instead, it takes effort for achieving it. As I mentioned before, machine does not understand text or word explicitly. This is why Preprocessing Data becomes a critical step in transforming data to be well-served for machine to be processed later by algorithm. The process includes five stages as follow.

1. Sentence Segmentation (End of Sentence)

Sentence Segmentation or EoS (End of Sentence) is a process where we take sentences within paragraphs in order to avoid feeding a whole paragraph to algorithm directly.

An example below shows 1 paragraph as the input and several sentences broken down from the paragraph as the output.

In : [‘Monique Corzilius was born and raised in Pine Beach, New Jersey, the youngest of the three children of Fred and Colette Corzilius. While she was a child, her mother began taking her to New York City on child acting auditions. Working under the stage name Monique Cozy, her first job came at age two, modeling in a print advertisement for Lipton soup’] Out : [['Monique Corzilius was born and raised in Pine Beach, New Jersey, the youngest of the three children of Fred and Colette Corzilius'],['While she was a child, her mother began taking her to New York City on child acting auditions.'],['Working under the stage name Monique Cozy, her first job came at age two, modeling in a print advertisement for Lipton soup']]

These arrays of sentence will later be chopped up on the next process: Tokenization.

2. Tokenization / Chunking

Tokenization is a process of breaking sequences of string or character into smaller pieces called tokens. Tokens can be words, keywords, phrases, or even sentences itself. In the process of Tokenization, some characters like punctuation will be discarded.

Let us use the same sample from previous step of EoS.

In : ['Monique Corzilius was born and raised in Pine Beach, New Jersey, the youngest of the three children of Fred and Colette Corzilius']Out : ['Monique', 'Corzilius', 'was', 'born', 'and', 'raised', 'in', 'Pine', 'Beach', 'New', 'Jersey', 'the', 'youngest', 'of', 'the', 'three', 'childrens', 'of', 'Fred', 'and', 'Collete', 'Corzilius']

As we can see, an array of sentence is chopped up into tokens. Sometimes the output of

Tokenization may not be accurate. Have a look at the result ‘Monique’ and ‘Corzilius’, and ‘New’ and ‘Jersey’. These tokens should be in one phrase, because it can not stand independently. Since we have this kind of problem, we need another process called Chunking. Chunking will take into consideration part of speech in order to group some words into one phrase or chunk. Let’s use previous example for this matter :

In : ['Monique Corzilius was born and raised in Pine Beach, New Jersey, the youngest of the three children of Fred and Colette Corzilius']Out : [‘Monique Corzilius’, ‘was’, ‘born’, ‘and’, ‘raised’, ‘in’, ‘Pine Beach’, ‘New Jersey’, ‘the’, ‘youngest’, ‘of’, ‘the’, ‘three’, ‘children’, ‘of’, ‘Fred’, ‘and’, ‘Colette Corzilius’]

Instead of breaking ‘Monique Corzilius’ into 2 pieces of tokens, we keep this as 1 chunk.

3. Stop Words Removal

Sometimes, if you find some words that are extremely common on the document or even language, you can call it as stop words. Stop words hold less meaning than other words, so it is safe to discard stop words from documents. The simple ways to detect those words whether they are stop words or not are as follow:

Let us say you have a collection of documents. Group and aggregate the document by its terms. You will get the total count of words in each term. This will serve as Term Frequency.
Sort for its term in a descending order.

Some common words excluded from previous step’s example:

In : ['Monique', 'Corzilius', 'was', 'born', 'and', 'raised', 'in', 'Pine', 'Beach', 'New', 'Jersey', 'the','youngest', 'of', 'the', 'three', 'childrens', 'of', 'Fred', 'and', 'Collete', 'Corzilius']Out : ['Monique', 'Corzilius', 'born', 'raised', 'in', 'Pine', 'Beach', 'New', 'Jersey', 'youngest', 'the', 'three', 'childrens', 'Fred', 'Collete', 'Corzilius']

We remove some common words (‘was’, ‘and’, ‘in’, ‘of’) to keep only tokens which hold more meaning than others. By reducing stop words, we can significantly minimize number of words or tokens that a system has to store. Some libraries such as NLTK in Python can handle this, but sometimes you need to define or at least have your own stop word list if the library does not support your language. Sometimes we can use TF-IDF (Term Frequency — Inverse Document Frequency) instead of TF (Term Frequency) to make our stop words. TF-IDF evaluates the importance of each word or token in sets of corpus or documents.

4. Stemming and lemmatization

For grammatical reasons, we may find that documents need to use different forms of words, such as organize, organization, organizing, or eat, eating, eaten, ate or etc. Both of stemming and lemmatization is normalized at the following examples.

am, are, is => beeat, eating, eaten, ate => eatboy, boys, boy's => boy

Stemming usually refers to a process of removing words from its either prefix or suffix. So, the words or token are transformed into base forms. Different from stemming, lemmatization transforms words or token into base form, taking Part of Speech into account, rather than just derivational affixes.

5. Word Embedding / Text Vector

Word Embedding is a technique and modern way of representing words as vectors. The aim of this stage is to redefine high dimensional words features into low dimensional feature vectors. It represents words into X and Y vector coordinated. Common models that can be used are Word2Vec, FastText.

Text Vector itself is not just word embedding, it has many other techniques which can be used to transform text into vector, such as:

Bag of Words
Continuous Bag of Words
One Hot Encoding
TF-IDF

Preparing good and well-served text dataset is a complicated art which requires choosing and applying correct algorithm. Many pre-built Python libraries can help you a lot in text preprocessing especially NLTK, Scikit-Learn, but sometimes you should define your own function to do some cleaning on your text dataset. If you are interested in Text Analytics or Natural Language Processing, follow our blog!