3 Types of Text Vectorization

kali prasad deverasetti
Nov 4 · 7 min read

As computer understands only numerical data, there is a necessity to convert Text into vectors so that these vectors can be fed directly into machine learning or deep learning models.

Converting text data into vectors is called vectorization or feature extraction. So In this article, we talk about three different types of vectorization techniques that can be applied to Text data.

Let's say we have 4 reviews(train data) and 2 reviews(validation data) in a list that has to be trained on a machine learning or deep learning model.

reviews = ['joker is an awesome movie',  
'joker is a must watch movie',
"Samantha Calls Joaquin Phoenix Joker Greatest Film She Ever Watched",
"Joaquin Phoenix gave his career best performance in joker"]

validation data is

validation_reviews = ['I am big fan of joker movie',  
'I will watch the joker movie again for Joaquin Phoenix']

1. Word Frequency Indexing using Sklearn CountVectorizer:

So the first vectorization technique that we are going to talk about is word frequency indexing.

To get the word frequency index vectorization, we will perform following steps:

  1. First, we build a vocabulary of all the words that are present in the text corpus.
  2. Count how many times each word has occurred in the whole corpus.
  3. Build a dictionary having each word as a key and the number of occurrences as value.
  4. Sort all the words from high to low based on the number of times of occurrences.
  5. After sorting we will give each word index from 1 to number of words in the whole corpus.
  6. So that the word which has occurred the highest number of times will have index 1, followed by the second most frequent word and so on.
  7. As we have an index of each word in the corpus, we can now encode each word in the train review list by there respective frequency indexes, Thereby converting each review into vectors.

The above steps can be done easily with Sklearn Countvectorizer code:

# Building the vocabulary of train reviews.
count_vect = CountVectorizer()
count_vect.fit(reviews)
count_vect.vocabulary_

Following is the count_vect vocabulary:

{‘an’: 0, ‘awesome’: 1, ‘best’: 2, ‘calls’: 3, ‘career’: 4, ‘ever’: 5, ‘film’: 6, ‘gave’: 7, ‘greatest’: 8, ‘his’: 9, ‘in’: 10, ‘is’: 11, ‘joaquin’: 12, ‘joker’: 13, ‘movie’: 14, ‘must’: 15, ‘performance’: 16, ‘phoenix’: 17, ‘samantha’: 18, ‘she’: 19, ‘watch’: 20, ‘watched’: 21}

As we can see vocabulary contains each word with there respective index. This indexing is done alphabetically by Sklearn Countvectorizer.

We need to do count_vect.transform to convert each review to the numerical vector based on the number of times a word occurs in that particular review, by default we get the output in the form of sparse matrix, we can convert them into numpy matrix to see the vectors.

train_vectors = count_vect.transform(reviews) 
train_vectors_array = train_vectors.toarray() print(train_vectors_array)
[[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1], [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0]]

So in the above array each row refers to each review and each column refers to a word in the corpus-based on the index we got from count_vect.vocabulary_

we can also get the feature or column names using count_vect.get_feature_names()

The word ‘joker’ is present in every review, from the vocabulary we can see the word joker index is 13.

In the train vectors array, we can see the 13th column contains all 1’s indicating the word joker is present one time in all the 4 reviews.

So adding all the numbers column-wise gives us the number of times a word occurs in the corpus, So we can get the word frequency in the entire corpus by adding the columns, let's implement this in code.

word_frequencies = train_vectors.sum(axis=0) word_count_list = [(word, count) for word, count in zip(count_vect.get_feature_names(), np.array(word_frequencies)[0])] word_freq_df = pd.DataFrame(sorted(word_count_list, key=lambda x: x[1], reverse=True), columns = ['word', 'frequency'])# giving frequency indexing
word_freq_df['freq_index'] = np.array(word_freq_df.index)+1 print(word_freq_df.head())
word frequency index
joker 4 1
is 2 2
joaquin 2 3
movie 2 4
. . .
. . .

encoding each word in review with its corresponding frequency index

vocab_dict = {}  
for row in word_freq_df.iterrows():
vocab_dict[row[1]['word']] = [row[1]['frequency'], row[1] ['freq_index']] vocab_dict
train_reviews_list = []
for review in reviews:
review_list = []
for word in review.lower().split():
try:
review_list.append(vocab_dict[word][1])
except:
pass
train_reviews_list.append(review_list)
train_reviews_list
[[1, 2, 6, 7, 4],
[1, 2, 17, 21, 4],
[19, 9, 3, 5, 1, 14, 12, 20, 11, 22], [3, 5, 13, 15, 10, 8, 18, 16, 1]]

As we can see we got the frequency index for each word in the review.

The good thing about Countvectorizer is when we pass the new review which contains words out of the trained vocabulary, it ignores the words and builds the vectors with the same tokens used in the training set.

Let's transform the count vectorizer on the validation set and see the review vectors:

valid_vectors = count_vect.transform(validation_reviews)
valid_vectors_array = valid_vectors.toarray()
valid_vectors_array
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0]]

As you can see train vectors array and valid vectors array both have the same number of features.

both the validation reviews have words ‘joker’ and ‘movie’, which is the reason 13th and 14th columns are 1's.

Full code for word frequency index is :

def seq_freq_index_encoder(reviews, count_vect=False):
'''
count_vect = False, if you are encoding on the training data
count_vect = vectorizer of train data if encoding on cv or test data
'''

returnable = 2
if count_vect==False:
returnable = 3
count_vect = CountVectorizer()
count_vect.fit(reviews)
vectorizer = count_vect
count_vect_xtrain = count_vect.transform(reviews)
word_frequencies = count_vect_xtrain.sum(axis=0)
word_count_list = [(word, count) for word, count in zip(count_vect.get_feature_names(), np.array(word_frequencies)[0])]
word_freq_df = pd.DataFrame(sorted(word_count_list, key=lambda x: x[1], reverse=True), columns = ['word', 'frequency'])
word_freq_df['freq_index'] = np.array(word_freq_df.index)+1
print(word_freq_df.head())

ax = sns.barplot(data=word_freq_df[:20], y='word', x='frequency')
ax.set_title("top 20 words")
plt.tight_layout()
plt.show()

vocab_dict = {}
# creating top 5000 vocab_dict
for row in word_freq_df[:5000].iterrows():
vocab_dict[row[1]['word']] = [row[1]['frequency'], row[1]['freq_index']]


train_reviews_list = []
for review in reviews:
review_list = []
for word in review.lower().split():
try:
review_list.append(vocab_dict[word][1])
except:
pass
train_reviews_list.append(np.array(review_list))
if returnable == 3:
return train_reviews_list, word_freq_df, vectorizer
else:
return train_reviews_list, word_freq_df

similarily we can do for test data if we have.

2. Keras Tokenizer text to matrix converter.

tok = Tokenizer()
tok.fit_on_texts(reviews)
tok.texts_to_matrix(reviews[, mode='count')
# mode can be one of "binary", "count", "tfidf", "freq" (default: "binary")

By using the above code we can convert text data into vectors.

we can get the index of the vocabulary using

tok.word_index
{'a': 6, 'an': 4, 'awesome': 5, 'is': 2, 'joker': 1, 'movie': 3, 'must': 7, 'watch': 8}

if we have eight words in the vocabulary, then we will get 9features in the final matrix, Not sure why but Keras tokenizer leaves the first feature or column as zeros. So final shape of the final matrix of 4 reviews will be (4, 9) provided we have five words in the vocabulary.

3. Converting text data into vectors using keras tokenizer text to sequence:

tok = Tokenizer()
tok.fit_on_texts(reviews[:2])
tok.texts_to_sequences(reviews[:2])
[[1, 2, 4, 5, 3], [1, 2, 6, 7, 8, 3]]

Difference between text to matrix and text to sequence using tokenizer is:

Both are encoded using the word index only, which we can easily get from tok.word_index

The main difference is, tok.texts_to_matrix will have features as all the words in the vocabulary with there corresponding index, so all the reviews will have the same number of features, as they are fixed by the training vocabulary.

But in tok.text_to_sequences vectors will not have fixed features. and each word will be encoded by there corresponding word indexes.

The problem of custom word frequency indexing vectorizer and Keras tokenizer text to sequence vectorization is:

Both the vectorization methods will output different numbers of features for each review or text, based on the availability of number of tokens present in the review.

This problem can be solved by using padding.

In padding, Zeros will be added to each review vector such that all the reviews will have same number of features, which is a requirement for both deep learning and machine learning models.

from keras.preprocessing import sequence
max_review_length = 20
X_train = sequence.pad_sequences(train_encoded_reviews, maxlen=max_review_length)
X_cv = sequence.pad_sequences(validation_encoded_reviews, maxlen=max_review_length)

By using the above code, both X_train and X_cv will have same 20 features.

I have applied all the 3 vectorization methods for applying lstm model on amazon fine food reviews for sentiment classification.

https://kali-ai.com/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade