Recurrent Neural Networks- An intuitive approach Part 2

Niketh Narasimhan
12 min readJul 31, 2020

--

Please find the link for part 1:

https://medium.com/@nikethnarasimhan/recurrent-neural-networks-an-intuitive-approach-part-1-ed5b2fec5722

Contents:

  1. Word embeddings
  2. Sentiment analysis

Word embeddings:

word embedding means converting words or texts into numbers or vectors ,which can be used by the machines to perform various tasks such as regression , classification etc as we are aware that most ML/DL algorithms can’t read raw text in it’s most basic form.

For example:Let us take the sentence “We will become successful data scientists ”

we list all the unique words as shown below to form a dictionary

[‘We’ ‘will’ ‘become’ ‘successful’ ‘data’ ‘scientists’]

the vector representation of the word ‘successful’ and ‘data’ can be represented in one hot encoded form where 1 is the position of the word and 0 elsewhere as shown below:

[0,0,0,1,0,0] and of converted is[0,0,0,0,1,0].

The above is just a simple example to show how text can be converted to numbers.

There are two approaches:

One hot encoding: A naive approach which represents the words along the axis , thus capturing no similarity.

Different types of Word Embeddings

The different types of word embeddings can be broadly classified into two categories-

  1. Frequency based Embedding
  2. Prediction based Embedding

Let us try to understand each of these methods in detail.

Frequency based Embedding

There are generally three types of vectors that we encounter under this category.

Count Vector:The count-based method views a target word by the nature of words that co-occur with the word in a multiple of contexts. This is determined using some form of co-occurrence estimation. In this case, the meaning of a word is conceived by the words that co-occur with that word in a variety of scenarios. A quick search of basketball coaches on google depicts

We can therefore asssume words such as division1 , high school are related to basketball coach , other terms such as “NBA”might also turn up. ’Sacking’, ‘firing’ , ‘hiring’ , remuneration terms such as ‘how much’ may also show up depending on the context.

How the co-ocurrence estimation is derived is understood using the example below

Let us take two sentences/documents .

D1: He is a good scientist. She is a good scientist.

D2: Niketh is a good scientist

The dictionary created may be a list of unique tokens/words (excluding prepositions as they are not very useful for analysis)in D1 and D2

=[‘He’,’She’,’good’,’scientist’,’Niketh’,’person’]

Here, D=2, N=6

The count matrix M of size 2 X 6 will be represented as –

it just denotes the frequency of occurrence of each word in each sentence for ex : ‘good’ appears twice in D1.

Note: We can chart a frequency map for different documents and the words occurring based on frequency or their presence ( yes or no)

Below is a representational image of the matrix M for easy understanding.

Note:Documents can be compared for similarity based on the frequency of terms common among them for example document 5 and document 1 has Term 4 common among them.

TF-IDF Vector:

Let’s understand tf-idf intuitively by considering the use of tf-idf in text clustering.

We can represent documents(text) as mutually comparable vectors of terms (words). Consider that we use these vectors to compare documents for similarity and put them into clusters(categories). In order to populate these vectors for each document, one way is to use raw document term matrix (with a row for each document and a column for every word) but these contain the count or frequency of terms with a global ordering throughout the corpus of documents. It has problems, such as there might contain too many terms in longer documents and too little or none in shorter documents — leading to sparse matrices and in-comparability.

A better approach is to assign a weighting to each term in the document and put that weighting in the vector. tf-idf is one such weighting that allows you to normalize the count/frequency of the term:

  • tf stands for term frequency. It is computed by dividing the number of times a term occurs in the document by the total number of terms in the document. This division by the document length prevents a bias towards longer document by normalizing the raw frequency of the term into a comparable scale.
  • idf stands for inverse document frequency. It is computed by taking the logarithmic of the total number of documents in the corpus divided by the number of documents where the term occurred. This normalization is to up-weigh the rare terms in the corpus.

The above definitions will be clear with the below example,

Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on Barack Obama is going to contain more occurrences of the word “Obama” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.

Therefore we would like to assign less importance to words such as ‘the’ , ‘a’ etc

TF-IDF works by penalizing these common words by assigning them lower weights while giving importance to words like Obama in a particular document.

Let us understand the terms

TF = (Number of times term t appears in a document)/(Number of terms in the document)

So, TF(This,Document1) = 1/8

TF(This, Document2)=1/5

IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

where N is the number of documents and n is the number of documents a term t has appeared in.

So, IDF(This) = log(2/2) = 0.

IDF intution:

Generally if a word has appeared in all the documents, then there is a high probability that it is a non relevant word like ‘This”, But if it has appeared in a subset of documents then probably the word is relevant.

Let us compute IDF for the word ‘Obama’.

IDF(Obama) = log(2/1) = 0.301.

Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Obama’ which seems to be of relevance to Document 1.

TF-IDF(This,Document1) = (1/8) * (0) = 0

TF-IDF(This, Document2) = (1/5) * (0) = 0

TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15

As, you can see for Document1 , TF-IDF method heavily penalizes the word ‘This’ but assigns greater weight to ‘Obama’. So you can see, tf-idf helps you rank the importance of a term to the document in its contextual document corpus.

Co-Occurrence Matrix with a fixed context window

Intution — Similar words tend to occur together and will have similar context for example — Dog is a pet. Cat is a pet. Dog and Cat have a similar context ie. Pet. To understand let us look at the below example which consists of two sentences which have to be compared

As can be seen penny is followed by earned , saved , wise once respectively , while a is followed penny twice.Thus earned is 1/3 times probable to be related to penny and also saved and wise.

let us go through one more example

Definition of co-ocurrence and context window
As can be seen on the right ‘he’ and ‘is’ appear 4 times together as seen in the red box on the left.

let us consider a corpus with number of words V, then the resultant matrix VxV becomes too large ,therefore we can use PCA as shown below

Use of PCA for singular value decomposition (SVD)

Advantages of Co-occurrence Matrix

  1. It preserves the semantic relationship between words. i.e dog and cat tend to be closer than dog and mango.
  2. It uses PCA at its core, which leads to computational efficiency and dimensionality reduction.

Disadvantages of Co-Occurrence Matrix

  1. It requires huge memory to store the co-occurrence matrix.
    But, this problem can be circumvented by factorizing the matrix out of the system for example in Hadoop clusters etc. and can be saved.

Prediction based Vector (word2vec)

Before we proceed any further let us introduce the concept of cosine similarity:

Let us consider the following sentences “have a good day” and “have a great day” , while these sentences mean the same to us , for a machine it can be different.

if we draw the dictionary of the two sentences also known as exhaustive vocabulary , lets say V = {Have, a, good, great, day} and one hot encode them

we would get

These 5 words represent 5 different dimensions if we were to plot them.These word also have no projections on each other as they are just ‘1’ in one of the dimensions and ‘0’ on the other.

This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.Our objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.

Here comes the idea of generating distributed representations. Intuitively, we introduce some dependence of one word on the other words. The words in context of this word would get a greater share of this dependence. In one hot encoding representations, all the words are independent of each other, as mentioned earlier.

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.

Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.

Here’s a list of words associated with “Sweden” using Word2vec, in order of proximity:

As can be seen cosine distance of Scandinavian countres is closer to 1

CBOW method ( common bag of words):

This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.

Let the input to the Neural Network be the word, great. Notice that here we are trying to predict a target word (day) using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (day). In the process of predicting the target word, we learn the vector representation of the target word.

Below are the steps for CBOw model:

Left: single context word , Right:Multiple context word(3 input layers representing 3 different contexts fed to the same hidden layer)

for multiple context words the steps remain the same, only the calculation of hidden activation changes. Instead of just copying the corresponding rows of the input-hidden weight matrix to the hidden layer, an average is taken over all the corresponding rows of the matrix. We can understand this with the above figure. The average vector calculated becomes the hidden activation. So, if we have three context words for a single target word, we will have three initial hidden activations which are then averaged element-wise to obtain the final activation.

  1. The objective function in MLP is a MSE(mean square error) whereas in CBOW it is negative log likelihood of a word given a set of context i.e -log(p(wo/wi)), where p(wo/wi) is given as

wo : output word
wi: context words

2. The gradient of error with respect to hidden-output weights and input-hidden weights are different since MLP has sigmoid activations(generally) but CBOW has linear activations. The method however to calculate the gradient is same as an MLP.

SkipGram :

SkipGram is essentially CBOw in reverse , Given a word predict it’s context. Let us take an example in this case given a word we are trying to predict it’s neighbouring words ( which gives us an idea of the “context”) .

Given the sentence:
“I will have orange juice and eggs for breakfast.”
and a window size of 2, if the target word is juice, its neighboring words will be ( have, orange, and, eggs). Our input and target word pair would be (juice, have), (juice, orange), (juice, and), (juice, eggs).
Also note that within the sample window, proximity of the words to the source word plays no role. So have, orange, and, and eggs will be treated the same while training.

The input vector for skip-gram is going to be similar to a 1-context CBOW model. Also, the calculations up to hidden layer activations are going to be the same. The difference will be in the target variable. Since we have defined a context window of 2 on both the sides, there will be “four” one hot encoded target variables and “four” corresponding outputs

four separate errors are calculated with respect to the two target variables and the four error vectors obtained are added element-wise to obtain a final error vector which is propagated back to update the weights.

The weights between the input and the hidden layer are taken as the word vector representation after training. The loss function or the objective is of the same type as of the CBOW model.

Skip Gram Architecture

Sentiment analysis:

Sentiment analysis is a machine learning technique that detects polarity (e.g. a positive or negative opinion) within text, whether a whole document, paragraph, sentence, or clause.

Purpose:

Since customer feedback is of primary importance to a corporate , a proper sentiment analysis on its reviews can lead to guidance to it’s managers for proper implementation of their strategy.

  1. Sentiment analysis helps businesses process huge amounts of data in an efficient and cost-effective way . For example millions of tweets and posts are generated on twitter and facebook n a day it is impossible to gauge them manually.
  2. Real-Time Analysis Sentiment analysis can help identify very sensitive issues such as an escalating PR crisis , a wave of negative customer reviews which if not nipped in the bud can lead to a severe crisis.
  3. Consistent criteria : Since sentiments can be highly subjective and dependent on the human evaluating , a centralized algorithm can help the company in minimizing errors due to a localized wrong perception.

Basic sentiment analysis of text documents follows a straightforward process:

  1. Break each text document down into its component parts (sentences, phrases, tokens and parts of speech)
  2. Identify each sentiment-bearing phrase and component
  3. Assign a sentiment score to each phrase and component (-1 to +1)
  4. Optional: Combine scores for multi-layered sentiment analysis

Sentiment analysis Through machine learning:

The Training and Prediction Processes

In the training process

(a), our model learns to associate a particular input (i.e. a text) to the corresponding output (tag) based on the test samples used for training. The feature extractor transfers the text input into a feature vector. Pairs of feature vectors and tags (e.g. positive, negative, or neutral) are fed into the machine learning algorithm to generate a model.

In the prediction process

(b), the feature extractor is used to transform unseen text inputs into feature vectors. These feature vectors are then fed into the model, which generates predicted tags (again, positive, negative, or neutral).

Feature Extraction from Text

This can be done through the use of word embedding (also known as word vectors). This kind of representations makes it possible for words with similar meaning to have a similar representation, which can improve the performance of classifiers.

--

--