Understanding the role of vectors in natural language processing

Bagavathy Priya
Analytics Vidhya
Published in
5 min readOct 9, 2020

Natural Language Processing:

It is the subfield of artificial intelligence that is used to process, understand, and analyze human language by the machine. There is an enormous amount of applications in NLP like machine translation, speech to text, text to speech recognition, spam message classification, etc,…

In this blog, we are going to see the role of vectors in natural language processing.

Before diving into NLP, let’s see what is a vector in terms of linear algebra

Vector:

Vector is a list of attributes of an object. In simple terms, it is a list of numbers. It is the way of identifying a point in space ( maybe two or more dimensional space)

Vector (a1,a2) in 2d co-ordinate

Matrix:

A matrix is an object that contains a set of vectors.

If I have two vectors in the form of
2x + 3y
&
4x + 5y
The matrix will be like
[ 2 3
4 5 ]

Now we will see why we are using the concept of vector spaces in natural language processing.

Why vectors in NLP?

We can not be able to feed words into the algorithm for machine learning or deep learning modeling. In turn, we have to convert the text data in the form of numerical data. So the vector concept comes into play. For this purpose, text data is converted into numeric in the form of vectors. There are so many preprocessing techniques are in NLP in order to feed the text data in the algorithms.

Feature representation:

In machine learning algorithms, before we feed the data into the algorithm, we preprocess the data in many ways like imputation, normalization, standardization, and feature engineering. We will see them in detail later. Likewise, in NLP we have to preprocess the data and extracting the important words, and removing words of less importance from the sentences (words like and, of, are, is,etc.)

After all this process, we have some extracted words called corpus. From the corpus, we will construct an embedding matrix, which will look like in the image below.

Embedding matrix

In the picture above, the values are in the form of a matrix and each column is a vector. Each column value represents the words in the corpus(word vectors) and the rows represent the features. The numbers below every column value ( Man-5391) represents the index position of that word in the vocabulary.

If we observe clearly in that matrix, in the first-row gender, man and woman, and king and queen, having opposite signs, indicating their relationship and gender, but there is no such value similarity occurred between apple and orange. If we take the next row royalty, the king and queen are having the highest value than man and woman. If we see last row food, apple, and orange having higher values than the rest of the other words. Thus, by doing so, the embedding matrix helps in converting text data into numerical form without changing the actual significance and relationship between the words.

2-D representation

If we plot these vectors in the form of 2-dimensional space, we will get a better idea about how these vectors are locating in the embedding matrix.

It will be like,

2D representation of embedding matrix

In the diagram, we can able to see that the king and queen are nearby and after a certain distance, there is man and woman. The apple, orange, and grape are groupings and similarly, numbers and animals are categorizing well in the plot.

In this way, we will preprocess the text data in document or paragraph by using vectors, in order to give in to various algorithms for purposes like spam classification, tweet sentiment analysis, etc.

Application:

Let’s see a practical application of NLP in order to understand the implementation in python.

We will look into the example of fake news classification using a text dataset.

Once we loaded the CSV data into the panda’s data frame, we will proceed with the text preprocessing. It will be done like

from nltk.corpus import stopwords
import re
ps=PorterStemmer()
corpus=[]

for i in range(0,len(message)):
review=re.sub('[^a-zA-Z]',' ',message['title'][i])
review=review.lower()
review=review.split()
review=[word for word in review if not word in stopwords.words('english')]
review=' '.join(review)
corpus.append(review)

Nltk is a python library used for NLP to preprocess the text data and re is the regular expression library.

In the above code, first we extract only the text data using the regular expression in the for loop.

review=re.sub('[^a-zA-Z]',' ',message['title'][i])

It accepts only the words contain a-z and A-Z, where message is the data frame containing the data.

Next, we used the lower function to lower case all the texts.

review=review.lower()
review=review.split()

So that the algorithm will not treats the lower case and upper case words in different way ( GOOD = good). Then we split the each words from the sentences.

Then, we have to remove the words of less importance, which are called stopwords using the below code.

review=[word for word in review if not word in     stopwords.words('english')]

Some of the stopwords are, like, of, is, and, are, was, were which are considered as the words don’t give any significance to the sentences. Then we append all those words into a list to form the corpus.

Next, we will construct the embedding matrix using the below code

cv=CountVectorizer(max_features=5000,ngram_range=(1,3))
x=cv.fit_transform(corpus).toarray()

where x is the embedding matrix

x.shape
// (18285, 5000)

where 18285 represents the number of corpus and 5000 is the number of features which we mentioned in the above code.

We can also able to see what are those features. We do not need to specify those features , they are inbuild in CountVectorizer.

cv.get_feature_names()[50:75]
//OUTPUT:
['administration',
'admiral',
'admit',
'admits',
'admitted',
'ads',
'adults',
'advance',
'advice',
'adviser',
'advisor',
'advocates',
'affair',
'affairs',
'affordable',
'afghan',
'afghanistan',
'africa',
'african',
'african american',
'ag',
'age',
'agencies',
'agency',
'agenda']

Then we will feed these x data and y label into the Naive Bayes algorithm for the purpose of classification.

You can view the entire python code for the fake news classifier here

Thanks for reading and the feedbacks are welcome

Regards,
Bagavathy Priya N

Reference:

https://www.coursera.org/learn/classification-vector-spaces-in-nlp

--

--

Bagavathy Priya
Analytics Vidhya

Data science aspirant loves to read , write , and play with codes and data