End-to-End Memory Network : Highlights

Shwetank Sonal
Analytics Vidhya
Published in
5 min readMar 1, 2020

To start with, Neural Networks are getting popularity thanks to Siri, Alexa, Cortana and Google Assistant. More and more applications are using neural networks in conjunction with natural language processing to create products that will define the 21st century.

Background

Neural Networks are algorithms that are loosely modeled to resemble the functioning of an animal brain. It consists of an interconnected set of neurons, connections, weights and propagation functions. These components together help to learn and train a model. There are two broad categories of neural network, namely, feedforward networks and recurrent networks. Feedforward networks are directed acyclic graphs of neuron, whereas recurrent networks allow the neurons to have connections between the same or previous layers.

FeedForward vs Recurrent Network

Recurrent Neural Network

One of the biggest advantages of RNN is, it can store information as memory, thus being helpful in leveraging contextual information. Long Short Term Memory (LSTM) is one of the most popular RNN architecture. Each LSTM unit consists of a cell and three gates, that are input gate, output gate and forget gate.

LSTM network

However, as it goes by the name, the concept of memory used in LSTM is short-term. There are numerous memory networks that address this problem, by providing a comparatively long-term memory. End-to-end memory network is one such example.

End to end Memory Network

The model architecture for this network was published in the paper by Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus. For our simpler understanding, let’s begin with an example, where solution lies in using memory as a part of the architecture:

Story :

  1. Mary got the milk.
  2. John moved to the bedroom.
  3. Sandra went back to the kitchen.
  4. Mary travelled to the hallway.

Query : Where is the milk?

Answer: hallway

As per this example, we need to take the context of entire story into consideration before answering the query. Use of end-to-end memory network becomes important in this use case. The architecture of this memory network is as follows:

Image Source : End-to-end memory network

Lets break it down in two parts, for our simpler understanding.

First part of the architecture helps to find the relevant sentence for the query. It starts with query being converted into a word embedding of size k. As a part of this process, the query q is first converted into a vector of size V (V is the size of vocabulary used). e.g. Following is the vector for the query “where is the milk” (after parsing):

Then, using a bag of word model or embedding B(k × V), the vector is converted into a word embedding of size k. Let the vector be finally called u(size k).

In next step, story needs to be converted into memory. Similar, to the approach followed above, each sentence is parsed and then encoded into a vector of size k, using embedding A(k × V).

Now, we have both story and query embedded as vectors. We take the inner product between query and each memory vector, followed by softmax operation, to find the best match.

At this stage, we have found out the most relevant sentence for the query. e.g. if the query was “where is the milk”, based on the calculated probability vector, relevant sentence that was found is “Mary got the milk”.

In the second part, final answer for the query will be calculated. We will start again by encoding the sentences of the story as vectors, using embedding C(k × V).

Using the probability vector from previous part, we will compute the output as follows:

For easier interpretation, we took the query and relevant sentence found before(i.e. Combination of “where is the milk” and “Mary got the milk”), as the new query for evaluation. Using this new query, and each of the sentence again as memory vectors, we tried to infer the answer from the context of the story.

Finally, prediction of output is obtained with the help of matrix W (V × k) through the following equation.

Conclusion

We can train this network with large number of stories, with its corresponding query and answer. Like LSTM, complexity of the architecture can be increased to include multiple layers, for better accuracy of the results.

--

--

Shwetank Sonal
Analytics Vidhya

Data Scientist | Machine Learning Engineer | Mathematician