Siri, what is an RNN?

FatimaEzzeddine
Zaka
Published in
10 min readNov 30, 2020

- Siri: “Check out this blog by Zaka on Medium. I’m sure they can explain it better than I can.”

- Wait, Siri, you can actually understand me?

Photo by Omid Armin on Unsplash

Yes, Siri can actually understand me, answer you and basically converse with anyone else who speaks any language from the very wide range of vernaculars Apple has at our disposition as features for Siri’s language understanding. Not to be biased towards the iProducts, Amazon’s Alexa and Google’s Assistant can do just as good of a job - if not better.

This is what Natural Language Processing scientists and engineers call “speech recognition”. If you think about it, speech is a sequence of words, or more precisely, phonemes that are grouped together in order to form what we call a sentence or a “sequence” of words (baby spoiler). And today, we live in a world of data abundance, an ocean of data we are all obliviously swimming in where many types of data prevail. The days of numeric and tabulated data ruling the scene are over. Move over old dogs, time for new tricks.

Modules of the blog:

  1. Smooth Introduction — Why we need Recurrent Neural Networks (RNNs)
  2. Applications
  3. Working Principles and Common Confusions around RNNs
  4. Training RNN Models
  5. Types of RNN Architectures
  6. Advantages & Disadvantages of using RNNs
  7. Next Generation RNNs — LSTMs and GRUs

So, how to handle this diversity and the new type of data? Well, different machine learning (ML) techniques need to be applied.

Sequential data challenged the average data scientist — which we will call “Joe”, just for ease of access — to find new ways of processing and forecasting it because it is merely very different in nature to canonical regression or classification problems with classical datasets.

The main characteristic of sequential data is that all the features are dependent in terms of the order in which they occur. This dependency is the reason we call the data ‘sequential’. Let’s take an example to better understand this concept. Imagine Joe is dealing with two sentences:

  1. RNN, what Siri is an?
  2. Siri, what is an RNN?

Saying order is crucial to text processing and understanding is an understatement. However, sequential challenges are not just restricted to text data. More on that later.

So moving on from this latter point, let’s grab our left ear with our right hand over our head and do a little literature back-propagation and start from the back (sentence too long? I made my point about the importance of order — case closed) by asking: why did Recurrent Neural Networks show up in literature?

Recurrent Neural Networks (RNN) were created because the traditional feed-forward neural networks face many issues:

  • in handling sequential data
  • in considering a sequence of inputs
  • in memorizing inputs and weights through time-steps

What does the Oxford dictionary have to say about the word ‘Recurrent’?

(adj.) Occurring often or repeatedly.

‘she had a recurrent dream about falling’

It is because RNNs perform the same task repeatedly for every element of the sequence.

Applications of RNNs

Continuing with our back-propagation in this topic, let’s look at some common applications where RNNs have taken over before we actually dive into the techy techy.

Image Captioning, Visual Search, OCR , Image Recognition

Given an image, the goal is to generate captions, hence analyzing the content by identifying objects and recognizing the actions.

Source

Time-Series Prediction

Predict future values based on previously observed values, by tracking the movement and pattern of the chosen data.

Source

Natural Language Processing (NLP)

Sentiment analysis by using NLP or other problems related to extracting information from sentences, text, spoken language…

Source

Machine Translation and Content Localization

Translate from one language to another (Ex: French to English); it can also be translated by understanding the content and the important points.

Source

Conversational UI, Speech-to-text, RNN Speech recognition

Creating chatbots to interact with people, communicate like humans, and comprehend what is being said to generate the right response or action.

Source

So, how do RNNs work?

We spoke a lot about these fuzzy neural networks but have yet to introduce their algorithm and working principles. Let’s dig in!

This is a common structure for a single neuron recurrent neural network. If you have no previous idea about neurons or neural networks, this blog might help you.

Source

RNN unrolling: Producing a graph that aims to illustrate the transformations and the steps that will occur. It’s a cyclic circuit diagram, where the input is processed with a time-step forward.

At any given time t, the current input is a combination of input at x(t) and x(t-1) — or any x(t-n)!

Confusion: When an RNN is unrolled, many beginning practitioners think that many cells are newly created, but actually it’s just one cell that will be called recursively. As mentioned above: it is a cycle.

Source

The way the current state h(t) is calculated is according to the following formula, which is a function f of the previously hidden state h(t-1) and the current input x(t).

As for the output, it is yielded from the following equation

Y(t) represents the output and W(hy) the weight of the output layer

And finally, the formula for applying the activation function

W(h) is the weight of the recurrent network and W(xh) that at the input of the neuron

Training an RNN model

Training an RNN model is like learning to walk: taking it one step at a time. We start with feeding a single time-step of the input to the network, where the calculation begins to take place and with the current state and the series of the previous states.

The current output becomes the input for the next time step and hidden layer.

We can go as many time-steps as is dictated by the application, network learning, and parameters specified by Joe, our data scientist. Then, we join the information from all the previous states.

Once all the time steps are completed and the weights are passed onto the next layers until reaching the output layer, the final current state is used to calculate the final output.

This output is then compared to the actual output and an estimation of the error takes place, just like in any other Neural Network backpropagation algorithm.

This same error is then backpropagated to the network to update the weights.

And voilà! You got your RNN model trained.

Types of RNN Architectures

Let’s observe, through the below depiction, the four types of sequence prediction problems that define different types of RNN architectures.

Source

One-to-one: Single-input and single-output (SISO). This is the traditional form of Neural Networks, classifying in single classes (binary classification).

One-to-many: Single-input and multi-output (SIMO), classified in multiple classes. Example: Image Captioning

Many-to-one: Sequence of inputs (or multi-inputs) generating a single output (MISO), sequence is classified in single class (binary classification of a sequence). Example: Text classification

Many-to-many: Sequence of inputs generating a sequence of outputs (MIMO). A sequence is classified in terms of multiple classes or multiple time-steps. Example: text translation, named entity recognition, …

You choose the type of architecture based on the problem you are trying to solve.

Advantages and Disadvantage of RNNs

Nothing in life comes without its disadvantages, and most times, this same disadvantage is what leads to innovation and better solutions as we will check out a bit further ahead in the blog.

Let’s start with the positives!

Advantages

  • Process input of any length
  • Model sizes don’t affect the size of the input
  • Computation takes into account previous information
  • Network weights are shared across time (i.e., the network can remember previous states)

Disadvantages

  • Slow computation
  • Cannot consider any future input for the current state, only previous values are taken into consideration
  • Vanishing gradient and exploding gradient problems
  • Cannot be stacked into very deep models
  • Cannot keep track of long-term dependencies in time

Vanishing & Exploding Gradient Problem

For deep networks, the back-propagation algorithm can lead to one of the following issues:

  1. Vanishing Gradients: This happens when the gradients become very, very small and non-valuable (almost 0), which means we cannot extract information correctly (weights become static and are no longer “updatable”).
  2. Exploding Gradients: This occurs when the gradients become too large due to back-propagation, and thus, explode (!) to infinite values (also weights are no longer “updatable”).

Solutions

Now, you didn’t think I’m the type of person who brings a problem without a solution, did you? Let’s enumerate some workarounds for going through the aforementioned issues with gradients:

  • Gradient Clipping
  • Identity Initialization
  • Truncated Back-propagation
  • Choosing the right activation function
  • Weight initialization
  • LSTM and GRU neural networks — and we expand on this below.

Long Short-Term Memory Networks (LSTM)

Long Short Term Memory networks (LSTMs) are a special breed of RNNs capable of handling and learning long-term dependencies. They do this through introducing the concept of “gates”. In particular, LSTMs contain 3 gates: forget, input, and output gates.

Source

Forget Gate

Decides how much of the past data should be remembered and which information should be removed from the cell at a particular time step. The sigmoid activation function is responsible for carrying out this task.

Source

Input Gate

Decides how much this unit adds or contributes to a current state. The sigmoid function and the tanh function determine the “how much”.

Source

Cell State

Decides the degree of importance of input by applying the point-wise addition.

Source

Output Gate

Decides what part of the current cell state makes it to the output. The sigmoid and tanh functions are applied here.

Source

GRU: Gated Recurrent Units

The GRU is another newer generation of RNNs and is similar to an LSTM but with more efficient cell computations.

The cool thing about GRUs is that they have fewer parameters. They only have two gates, a reset gate and an update gate, and consequently, fewer tensor operations which makes them faster to train compared to LSTMs.

Source

Update Gate

The update gate acts similarly to the forget and input gates of an LSTM. It decides what information to throw away and what information should be passed to the next step or layer.

Reset Gate

The reset gate is another gate used to decide how much past information should be “forgotten”, as in unwanted or not needed anymore.

We can say that when we move from RNN to LSTM or GRU, we are having the ability to be more flexible in controlling the flow of information, in terms of mixing inputs with previous states, which has the most impact on controlling the outputs.

So, LSTMs and GRUs give us the ability to control and achieve better results but in a more complex and costly way caused by the gate calculations, which require more mathematical operations than a traditional RNN.

Bonus!

Misunderstanding by Practitioners

As we all know, time steps are the steps we run on the RNN, so they are not features themselves. Some mistakes can be made as to feeding observations at previous time steps to the model as input features or as output features. This can be done in other machine learning algorithms but not in RNN.

Keras for building RNN, LSTM, GRU models

Keras is an ML framework that offers us the ability to train any type of neural network by including all types of layers: RNN, LSTM, GRU, Dense, Convolutional, etc.

All you have to do is to stack your specific layers depending on your needs to build your network.

Below is a small demo on how to create a single RNN, LSTM, and GRU layer using Keras.

RNN Layer

rnn = keras.layers.SimpleRNN(n_cells, return_sequences=True, return_state=True)

(Refer to the documentation for more details.)

LSTM Layer

lstm = keras.layers.LSTM(4, return_sequences=True,return_state=True)

(Refer to the documentation for more details.)

GRU Layer

gru = keras.layers.GRU(4,return_sequences=True, return_state=True) 

(Refer to the documentation for more details.)

Conclusion

RNNs, LSTMs, and GRUs provide a promise for Artificial Intelligence applications to leverage history and time patterns in unique data types such as human language and time-series data.

Don’t forget to support with a clap!

You can join our efforts at Zaka and help democratize AI in your city! Reach out and let us know.

To discover Zaka, visit www.zaka.ai

Subscribe to our newsletter and follow us on our social media accounts to stay up to date with our news and activities:

LinkedInInstagramFacebookTwitterMedium

--

--