Traditional Text Vectorization Techniques in NLP

Saurabhk
Analytics Vidhya
Published in
3 min readOct 26, 2020
Image by author(Male, Maldives)

Vectorization is mapping of vocabulary words or tokens from dataset to a corresponding vector of real numbers. These vectors are used as input to Machine Learning (ML) models. Now a days more recent Word Embedding approaches are used to carry out most of the downstream NLP task. In this post lets us look at pre word embedding era of text vectorization approaches.

Statistical based vectorization approach

Pre word embedding era, Statistical word vectorization based approaches like counting word co-occurrences, weighting matrix were used to extract features from text for later use as input to machine learning algorithms.(Turney, P. D., & Pantel, P. 2010).

1. one-hot encoding

Table 1. presents a way of representing each unique word in vocabulary by setting unique token with value 1 and rest 0 at other positions in the vector.

Example
Sent. 1: They are playing football.
Sent. 2: They are playing cricket.
Vocab.: [They, are, playing, football, cricket]

The disadvantage of Size of the vector is equal to count unique word in the vocabulary.1-hot encoding misses the relationships between words and does not convey information about the context.

2. Bag-of-Words(BoW)

BoW is a vectorization technique that converts the text content to numerical feature vectors (P. D. Turney. 2002). BoW model keeps the count of words concerning the document it occurred, here each vector acts as a feature column for ML model. Table 2. demonstrates an example of features per document.

Example
D1: They are playing football.
D2: They are playing cricket.

The disadvantage of BoW is that it doesn’t preserve the word order and does not allow to draw useful inferences for downstream NLP tasks.

3. n-gram

n-grams considers sequence of n words in the text; where n is (1,2,3.. ) example 1-gram, 2-gram. for token pair. Unlike BoW, n-gram maintains word order.

Example: A swimmer is swimming in the swimming pool.
Unigram(1-gram): A , swimmer , is , swimming , in , the , swimming , pool ……
Bigram (2-gram): A swimmer , swimmer is , is swimming , swimming in ………
Trigram(3-grams): A swimmer is, swimmer is swimming, is swimming in………

The disadvantage of n-gram is that it has too many features. The feature set becomes too sparse and is computationally expensive.

4. Term frequency-inverse document frequency(TF-idf)

TF-idf gives more weight to rare occurring events and less weight to expected events.TF- idf penalizes frequently occurring word that appear frequently in a document like “the”, “is” but assigns greater weight to less frequent or rare words.

Formula
TF(t) = frequency of a token t in document d / count of all word in document d
idf(t) = log(Total number of documents / Number of document with token t)

The product of TF x idf of a word indicates how often the token (t) is found in the document and how unique the token is to whole entire corpus of documents.

5. Pointwise mutual information (PMI)

PMI usually identifies pair pattern in text (Turney, P. D., & Pantel, P. 2010). Formula : occurrence (word1 and word2) / count(word1) * count(word2))

Example: Suppose, In a document word1 (car) and word2(drive) occurring may have lower probabilities. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score like word1 (that) and word2 (is).

All of these approaches we saw here suffers from vector sparsity issue and as a result doesn't handle complex word relations and are nor able to model long sequences of text.
In the next post we’ll try to look at new age text vectorization techniques.

--

--