Text Data: How to make it ML ready?

Published in

Let’s Deploy Data.

4 min readJul 20, 2020

There are different types of data that would be dealt with in Machine learning. The common types of data are numerical data, time-series data, categorical data, text data, and image data. Whatever the type of data we have, in the end, it should be converted into numerical form. Even if it is text data or image data, in the end, it should be converted to numerical form so that the Machine Learning algorithm could process it.

Text data need to be pre-processed before it is fed into the Machine Learning algorithm. Let us look into the steps that are involved in the pre-processing of text data.

1) Normalization

Sometimes in a dataset, we might come across words that hold the same meaning. There can be words that have different spellings, for example, colour or color. So in order to remove this normalization is done.

Text normalization is the process by which the text is converted into a standard (canonical) form.

Some of the examples of Normalization include lemmatization, stemming, removal of Stop Words, tokenization. Let’s check out what each means.

(a) Lemmatization

Lemma means dictionary form of a word. Lemmatization is the process of reducing multiple inflections of a single word.

From the table, we could see how the word trouble is lemmatized. It is done so that the same word doesn’t get stored twice.

(b) Stemming

Stemming is another form of text normalization method which is very similar to lemmatization. This is also a method to reduce the multiple inflections of a word.

c) Stop Words

Stop words are those words that have a high frequency in a dataset. For example words such as is, the, which, etc. So during the pre-processing, the stop words are usually removed. Removing stop words will not affect the overall meaning of a sentence or text.

(d) Tokenization

Tokenization is the process of splitting the string into tokens or words.

For example, if the string is “My name is Rose”, then the tokenized form would look like this: — [ My, name, is, Rose]

2) VECTORIZATION

After the normalization is completed, the actual encoding of data begins. We will look into the text and extract those that are needed. Then it is converted it into a numerical form so that it is accessible to the machine learning algorithm.

Text vectorization is the process of converting text data into vectors. Vectors are nothing but an array of numbers. There are many ways to vectorize a text. We will be discussing two of them.

(a) TF-IDF (Term Frequency- Inverse Document Frequency) Vectorization

(b) Word Embedding which is done with Word2Vec or Global vectors

Let’s look into them in detail.

TF-IDF (Term frequency- Inverse Document Frequency) Vectorization

In TF-IDF weights are given to each word. More weightage is given to words that are of high importance. Less weightage is given to words that contain less information and are common such as “the”, “is”.

Let’s consider an example.

Now weights are assigned to each of the words of the Normalized data.

2) Assigning vectors for each word

Now if we arrange this in a table containing all the data, it’d look something like this.

So in TF-IDF, each word gets a vector according to the usage and importance. Then it is arranged in a tabular form for further analysis.

WORD EMBEDDING

It is another method of vectorization where words having the same meaning are given a similar representation. Word2Vec and GloVe are the two methods in Word embedding.

Text Data: How to make it ML ready?

Written by Gayathri Rajan