Text/Word Vectorizing Techniques

Nageeta kumari
4 min readFeb 2, 2024

--

First understand why we want to learn Text vectorizing techniques, what is need?…

When we have a huge text data and we want to perform some NLP task to it, of course we can not feed text data to models.

and if we think of ML models so whenever we want to train any ML model on any dataset we don’t directly pass all the data to model, first we preprocess it so that we can remove noise from our data and then based on our understanding of particular data we select some model for that.

same in case of NLP, we need to preprocess textual data so we can identify the meaningful information from it. so Text vectorization is one of the technique where you can represent your data inform of numeric metric to preprocess it.

so there are dozens of different techniques since the birth of Natural language processing to preprocess text data so that it can be given to any model to process or more specifically to understand when we talk about NLP.

The two main concepts, vectorizing and embeddings.

Vectorizing is a process of converting words or text into numerical vectors.

Embeddings are dense, continuous vector representations of words or text that capture semantic relationships.

and in this article i will only talk about vectorizing, so let’s have a look at vectorization techniques.

first you have to tokenize you text/document, and i assume you are familiar with tokenizing technique, if not then first look for it!

Count Vectorizer

It’s a simple preprocessing technique to convert your text document in form of a matrix where you can see the tokens count of your document.

let ‘s see how it works.

Step 1: Tokenize

it breaks down the text in form of tokens. for example if your document has the following text, “ I eat mango and mango is sweet” so the tokens based on space would be [I, eat, mango, and, mango, is, sweet].

Step 2: Counting

Now after tokenizing, it counts how many times each token occurs, keeping the above example the result would be { I:1, eat:1, mango:2, and:1, is:1, sweet:1 }.

Step 3: Matrix Representation

Then to summarize it present the above counting in form of matrix where each column is a unique word in count dictionary and each cell contains the count of that word for the corresponding document.

Now let’s see what information this CountVectorizer can give us and what it does not.

After seeing how it’s works, it’s clear that it’s a frequency counter, so using this one can analyze the most frequent and least frequent words in a document. it’s yet simple and easy to compute.

But a part from that we can’t get any other information, as it does not consider the order or context of words it’s just deal with frequency of words, and if we have a large vocab in our document then it may result a high dimensional space which is hard to interpret.

Now let’s see another technique.

TF-IDF Vectorizer

Suppose you have different documents belongs to a corpus, and you want to identify the rare words that may appear in all documents of that corpus, because common words don’t contribute much to your data, assume a document that has 1000 words, and it has the word, “The” 100 times, Does “The” provides any interesting information? off course not!. But if there is a word “Cancer” in document that appear only time, then it’s will definitely provide useful information.

But with count vectorizer we can not assign weight to words, so here comes the concept of TF-IDF(Term Frequency-Inverse Document Frequency Vectorizer), another technique similar to Count Vectorizer but it introduces a weighting scheme that considers the importance of words not only based on their frequency in a document but also on their rarity across all documents.

let’s see How it works!

Step 1: Calculate TF-IDF score

Calculate the Term Frequency​(TF) and calculate the Inverse Document Frequency (IDF) to measures the rarity of a term across all documents in the corpus. Multiply both to get TF-IDF Score.

Step 2: Matrix Representation

The final result will be matrix where each row corresponds to a document, each column corresponds to a unique term, and each cell contains the TF-IDF score of the corresponding term in the document.

One more thing to mention, the IDF component of TF-IDF emphasizes terms that are rare across the entire corpus. If a term is common and appears in many documents, its IDF value will be lower, resulting in a lower overall TF-IDF score. On the other hand, if a term is rare and appears in only a few documents, its IDF value will be higher, giving it a higher TF-IDF score.

So we can see with TF-IDF Vectorizer, we can not only see the frequency of any term or word but also can see it’s importance.

now let’s have a look at another technique!

One Hot Encoding

Yet one of the most widely used technique to convert categorical data to vectors. you can use this technique whenever you have any feature that has categorical values.

So the question is How it works? Easy and Simple one!

Determine the distinct categories within your feature by counting them. If there are three unique values, generate corresponding columns for each value. For each instance, assign a value of one to the columns that match the instance’s value, and fill zero in all other columns.

The issue that you face with this technique is, if you have more categorical values then it increases the dimension of your data which is not efficient.

And the most important thing, it does not preserves context or semantic meaning of text, but it is very useful for ML models when you have to preprocess categorical data.

--

--