The working of RNNs (the box that generates outputs that you receive from voice assistants like Alexa and Siri)

Rishit Dholakia
DataSeries
Published in
5 min readOct 6, 2019

As in my previous blog, What makes Alexa, Siri and google assistant respond intelligently, I had given an overview of how these voice assistants work and how they generate the intelligent output. In this blog, I will dig into how the box actually works so that you get the relevant output. Since it’s not rocket science that the box knows our natural language, there is some math and logic used to give the output. Before I get into work, I strongly recommend you go through the basics of neural networks and different types of loss functions.

So let’s get right into it :).

Recurrent Neural Networks

Most of the sequence to sequence models are based on one Deep Learning algorithm called recurrent neural networks. The one reason this neural network is preferred is because of its ability to store memory of the previous sequences and produce output with respect to. As shown in the figure, it works as a feedback network where the output generated is then again passed into the same network as input. In theory, RNNs can make use of information in arbitrarily long sequences, but in practice, they are limited to looking back only a few steps.

Some of the other advantages of RNNs are that:

1) They allow a variable length of output to be displayed, which means that they do not have fixed outputs like those in DNN and CNN. For example: ‘hi guys’ is of length 2 and another sentence ‘this is me ’ is of length 3 and this type of variable lengths in a dataset can be passed to the recurrent neural network to get outputs of variable lengths.

2) Other networks do not learn the meaning of the words in different positions of the text. For example: Whether the sentence should be “the apple and pair salad” or “the apple and pear salad”.

The recurrent neural network stores the states(activation) of each layer of the network during training and then uses these values for generation of the new sequences. Give an input Xt and pass it through the neural network that has weights Wxh which are altered after every iteration for each layer. The activation function Ht is used to store the information of the previous sequences. The output of each layer is Yt . In short the input for example maybe “The box is red ”, the input word “The” is Xt and it has to predict the output “box’’.

Note: To be kept in mind that the left RNN this just one recurrent box that gives the output and the one on the right is the same box repeated over the multiple timestamps.

This network now follows the principles of forward and backward propagation to reduce the loss of the prediction done on the input. The following are the series of equations used in the neural networks.

Forward Propagation:

So to say in short what the below formulas represent, it says that the input at time t when multiplied with its weight will be added with the (t-1)th activation(this is basically the memory stored) so both together then is used to predict the output at time t (the equations comes out in mathematical terms that given the previous and current word what is the probability of the correct output.)

The current activation is represented as:

Here in diagram f is multiplication of wh and h(t-1).

The sigma represents the sigmoid function applied.

The output is represented as:

W(hy) is the weight required to calculate the output.

The loss used is the cross entropy loss(the error, of weather the right word was predicted or not):

Here M represents the number of vocabulary words.

Back Propagation:

It follows the normal trend of the artificial neural networks that is used to get the least error predictions. The neural network would stop at a given timestamp, thus not producing more sequences after the timestamp. The reason is that the weight values and other parameters are generated based on the timestamps of those limits. The other option to use is to have the values go till <EOS> which is known as an end of sentence tag, which when passed will stop the neural network form generating more values. Example: <EOS> this is me <EOS> would be padded during the pre-processing phase.

There are two levels of RNNs that can be formed. These are:

a) Character level RNNs

b) Word-level RNNs

The vocabulary of the character level RNNs would consist of all characters like a,b,c,d,e, etc. The vocabulary of word-level RNNs consists of words like this, the, about, etc. Using character level RNNs would be computationally expensive the RNNs would have to predict each character in each of the words and then predict the rest of the sentence, which is much more viable.

Problem with this network

Not everything is perfect in this world, and the same applies to the RNN box that is represented about. It faces a problem of vanishing gradient. So what is vanishing gradient?

Vanishing gradient is when the weight values decrease exponentially over iterations giving less significance the weights. Let us consider two sentences “The cat who ate apples was full today” and “The cats who ate apples were full today”. So the words cat and cats need to have was and where as the correct predicates respectively, but the normal RNNs do not capture this level of information due to vanishing gradient problem. Sometimes exploding gradient occurs, but this is rectified by using the gradient cupping method.

In order to cope up with this problem, 2 variations of RNNs are used. The cells are the ones that are different in networks, the rest of the structure remains the same. These are known as gated recurrent units also known as GRUs and long short term memory units also commonly known as LSTM units.

I guess this would be so much to take in with all this math, so my next blog will talk about how the GRU and LSTM help remove this problem.

Stay tuned :)

--

--