Deep Sequence Modeling

Whether it’s an audio waveform of your voice or a sentence we can think of this type of data as sequential because it consists of related data points that occur sequentially; like words or waves. When we feed a sequence into a model and output a new sequence (e.g. input words and output a new sentence of words) we consider that sequence to sequence modeling.

Machine translation, summarization, question answering are all sequence to sequence. And the Encoder-Decoder model has been the go-to model for sequence to sequence models.

I follow this Sequence Models Coursera course taught by Andrew Ng on Sequence models. In this course I learned about different types of RNNs, language model and sequence generation, Gated Recurrent Units, LSTMs, Bidirectional RNNs, Deep RNNs, Word Embeddings, Beam Search, Attention Models and Speech recognition.

Why Sequence Models?

  • Speech recognition — Input audio clip, output text transcript.
  • Music Generation — Input nothing (or genre), output a sequence of notes.
  • Sentiment classification — Input phrase, output number of stars this review could be.
  • DNA sequence analysis — Input your DNA sequence, output which part of sequence belongs to a protein.
  • Machine translation — Input French, output English.
  • Named Entity Recognition — Input a sentence, output who the people in the sentence are.
Examples of sequence data: https://www.coursera.org/learn/nlp-sequence-models/lecture/0h7gT/why-sequence-models

Coursera Week 1

Recurrent Neural Networks

Why not use a standard neural network?

  1. Inputs and outputs can be different lengths in different examples.
  2. Doesn’t share features learned across different positions of text.
Forward Propagation of a RNN
RNN Forward Propagation Equations

In the notation, Wax, the subscript x in the second index means Wax will be multiplied by an x like quantity. And the subscript a means it will be used to compute an a like quantity.

However we can compress the two weight matrices down to just one matrix, Wa. The advantage is that we don’t have to carry around two separate parameter matrices Waa and Wax, we can compress them into just one parameter matrix Wa. Here’s the new notation:

  • One downfall in RNN is that they only take into consideration information that happened earlier in the sequence, but not later. So if for example you’re trying to determine whether Teddy is a name in the sentence He said, “Teddy Roosevelt was a great president.” Knowing only the first 3 words He said “Teddy” isn’t helpful. However, Bidirectional Recurrent Neural Networks address this problem.
  • Tanh is the most common activation function used in RNN, sometimes ReLU is also used.
  • The choice of activation function for your output y will depend on your classification problem: binary classification(sigmoid function) , k-way classification (softmax function).

Suppose that y<t> is a person’s name. And y hat is what your neural net output as the probability of a word being the person’s name. In order to compute backpropagation you need a loss function. Here’s the standard element-wise Cross-entropy loss:

Element-wise Cross-entropy loss at a single time step, t, for a single word
Loss for the entire sequence. Notice the L has no time stamp, t.

Examples of RNN Architecture

  1. Many-to-many — Used in Machine Translation (Encoder-Decoder).
  2. Many-to-one — Text reviews to star ratings
  3. One-to-many — Used in music generation
The first many-to-many can be used for Named Entity Recognition. The second Many-to-many can be used for Machine Translation Encoder-Decoder (Tx and Ty are not the same).

Language Modeling & Sequence Generation with RNN

A Speech recognition system would choose “The apple and pear salad” instead of “The apple and pair salad”. So the language model will estimate the probability of a sequence of words.

Below is an RNN language model for predicting the next upcoming word.

a<1> is a softmax activation function that outputs the probability of what the first word is. For step 2 we will tell is that the correct word was “cat” and will pass that to the input of the next neural layer. We will continue passing the previous (correct) words forward as inputs into the next steps. Essentially, this RNN learns one word at a time going from left to right.

Vanishing Gradients with RNNs

Disadvantage:

The sentence The cat, which already ate.......was full has a term that was that is dependent on cat but it’s located far away from cat. Was should be singular because cat is singular; rather than the plural cats and were. Basic RNNs above are not very good at capturing long term dependencies. This is due to the problem of vanishing gradients in very deep neural networks (because the gradients have a hard time propagating from the outputs of the errors from the later time steps to affect the computations of the earlier time steps).

Gated Recurrent Unit (GRU)

GRUs solve the vanishing gradient problem. GRUs provide a memory cell c.

At every time step we are going to consider overwriting the memory cell with a value č, the candidate new value.

Gamma with subscript u, Γu , is an update gate value that’s always either 0 or 1. And u stands for update. It decides whether we will update the memory cell c with č.

Simplified GRU https://www.coursera.org/learn/nlp-sequence-models/lecture/agZiL/gated-recurrent-unit-gru
Full GRU with relevance gamma, Γr.

For full GRU. Gamma with subscript u, Γr , is a relevance gate value that’s always either 0 or 1. And r stands for relevance.

LSTM is also another common version. But there isn’t a consensus on when to use LSTM vs GRU. GRU is simpler and computationally faster. LSTM is more powerful and flexible since it has more gates.

LSTM Unit — Long Short Term Memory

Even more powerful than the GRU. It has 3 gates total.

Γu is the update gate.

Γf is the forget gate.

Γo is the output gate.

Bidirectional RNN

BRNN allow you to take information from both earlier and later in the sequence.

The green blocks with backward arrows represent the Backward Recurrent layer. These backwards blocks will be connected to each other. First the forward sequence will compute all the way to the end. Then the (green) backward network activations will compute backwards all the way to the beginning. Then you can make your predictions.

For example, to make a prediction for y_hat at time step 3, information from x<1> flows through forward a<1> to forward a<2> to forward a<3> to y_hat<3> therefore information from x<1>, x<2> and x< 3> are all taken into account. And information from x4 flows in through a backward a<4> to a backward a<3> to y_hat<3>.

Follow the yellow lines to see how the data flows from the example we described above when trying to predict y_hat<3>.

NOTE: These blocks don’t have to be just RNN blocks, they can also be LSTM or GRU blocks as well!

A BRNN with an LSTM, seems to be used a lot in NLP. If you have an NLP problem where you have a complete sentence and want to label things in the sentence a Bidirectional Recurrent Neural Network with an LSTM blocks would be a reasonable first thing to try.

Disadvantage

The disadvantage to BRNN is that you need the entire sequence of data before you can make predictions anywhere. So in speech recognition you’d need to wait for a person to stop talking before processing it.

Coursera Week 3

Machine Translation

The motivating example here is translating a French sentence (x) to an English sentence (y_hat). Let’s assume we have an English dictionary/vocabulary of 10,000 different words.

Jane visite l’Afrique en Septembre.

should translate to

Jane is visiting Africa in September.

Machine Translation (MT) can be posed as a “conditional language modeling problem” where the encoder’s output is fed as the input into a language model, instead of the usual zero vector initialization.

Notice how the Decoder portion looks just like the Language model above it.

But the major difference between MT and the earlier language model is rather than wanting to generate a sentence at random, we want to, instead, find the best or most likely English translation.

And because the set of all set of English sentences are too large to enumerate one-by-one, we’ll look into the Beam Search algorithm next.

Beam Search

Given an input French sentence, we don’t want to output a random English translation, we want to output the best/most likely translation. Beam Search is the most widely used algorithm to do this.

Step 1

The first thing Beam Search does is it tries to pick the first word of the English translation that it’s going to output, P(y_hat<1>|x).

It can consider multiple alternatives. There’s a parameter, B the beam width, that determines how many different choices it will consider. Let’s set B=3, so it will pick 3 top words and keep them in memory.

Let’s say in step 1 of the Beam Search algorithm the beam width B=3. Run the input sentence into the Encoder network (green). The first step of the Decoder network (purple) has a Softmax output over all 10,000 Englishvocab words, it would keep the top 3 of the 10,000 outputs in memory.

Let’s assume that it picked “in”, “jane” and “september” as the top 3 most likely words.

Step 2

It will iterate over each of the 3 words “in”, “jane” and “september”. First it will look at “in”. It will try to pick the best second word, from the English disctionary, that should come after “in”.

It will assess all 10,000 vocabulary words again, however, this time it is trying to maximize the likelihood of y<1> and y<2> given y<1>. It will assess probabilities of pairs starting with (“in”, “a”), then (“in”, “aaron”), etc until the end of the vocabulary.

Notice that y_hat<1> is hardset to “in” and fed into the next block which will try to pick the most likely y_hat<2> which is P(y<2> | x, “in”). Source.

We ultimately care about finding the pair of the first and second word that are most likely (not just the first word alone). Remember from the rules of conditional probability, we can express this as the probability of the first word multiplied by the probability of the second word, ∴P(y<1>, y<2>|x) = P(y<1>|x)·P(y<2>|x, y<1>).

It will repeat these exact steps for “jane” and then “september”.

This will consider 30,000 possibilities because (beam width )(vocabulary) = (3)(10,000) = 30,000.

You only need to instantiate 3 copies of the network to efficiently compare “in”, “jane” and “september”. No need to instantiate 30,000 copies of the network.

Step 3

It selected the top 3 most probable pairs and stored the probabilities in memory:

  • in september
  • jane is
  • jane visits

In step 3 it will keep going to try to pick the 3rd most likely word combinations. See the image below

NOTE that if you set the beam width = 1 this will just be the Greedy algo which does not work as demonstrated in this previous video.

Watch here, at 7:20, for tips on how to choose B.

Length normalization is a refinement you can apply to improve Beam search which involved using log probabilities and normalizing with alpha=0.7 [source: video].

NOTE that Beam Search it is an approximate (heuristic) search algorithm. So it doesn’t always output the most likely. It can make a mistake.

Error Analysis in Beam Search

Let’s look at error analysis that help us determine if beam search made a mistake or is it just our RNN that is to blame.

The short version explanation is:

  • Have a human translate the sentence, y*.
  • Case 1: If P(y*|x) > P(y_hat|x) then beam search is at fault.
  • Case 2: If P(y*|x) ≤ P(y_hat|x) then the RNN model is at fault.

For more detailed explanation watch this video.

Bleu Score — Bilingual Evaluation Understudy

One of the challenges of machine translation is that, given a French sentence, there could be multiple English translations that are equally good translations of that French sentence. So how do you evaluate a machine translation system if there are multiple equally good answers?

If there are multiple great answers, how do you measure accuracy?

The way this is done conventionally is through something called the BLEU score.

The BLEU score is a popular automatic evaluation technique.

First you count the n-gram scores (ratio of n-grams that occur over the number of n-grams).

Then calculate the Brevity score (or brevity penalty) by dividing the length of the output by the length of the reference. Then take the min(1, brevity).

Finally, calculate the BLEU score by multiplying the n-gram scores and taking the 4th root, then multiplying by the brevity.

Example:

Calculating the BLEU score. [Source]

Attention Model

It is very difficult for a MT system to memorize a very long sentence. The BLEU score drops significantly. So the Attention model looks at part of a sentence at a time which helps stabilize the BLEU score.

The green line is what the Attention model achieves.

The question we’re trying to answer is, when you’re trying to generate an (output) English translation, how much context should each word of the sentence pay attention to in respect to the input? It uses attention weights (α)to achieve this.

Speech Recognition — Audio Data

One of the most exciting developments with sequence to sequence models has been the rise of very accurate speech recognition.

A common pre-processing step for audio data is to run your raw audio clip and generate a spectrogram (plot audio intensity frequencies over time). Which is pretty similar to what the human ear does.

In this example we want to transform audio data to a text transcript. We have an audio clip where someone is saying: “the quick brown fox”. The number of inputs being fed into the model is usually very large with audio data, especially compared to the much smaller output of words.

Let’s image that this is a 10 second audio clip at a frequency of 100Hz, that would be 1000 inputs being fed into the model (BRNN, either LSTM or GRU).

Connectionist temporal classification (CTC)

The CTC cost function’s basic rule is to collapse repeated characters that are not separated by a “blank” which I’ll denote as an underscore _. So the CTC cost function allows the RNN to generate an output like:

t t t _ h _ e e e _ _ _ q q q_ _ …

Which will ultimately end up as “the quick brown fox” which is only 19 characters long (including spaces). So the neural network was able to force upwards of 1000 characters by allowing the network to insert blanks and repeated characters to represent a 19 character output.

https://www.coursera.org/learn/nlp-sequence-models/lecture/sjiUm/speech-recognition

Google Neural Machine Translation

https://www.inf.ed.ac.uk/teaching/courses/anlp/lectures/30/

Google Neural Machine Translation is an example of Interlingual translation.

https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html

Here’s how it works:

Train a system on a few pairs of languages like English, Japanese and Korean on the encoder, decoder architecture.

The encoder encodes your sentence to some hidden representation. And the decoder takes that hidden representation and decides it to the target sentence.

The amazing thing is if, for example, you just take your encoder for Japanese, and your decoder for Korean, it somehow works! Even though the system has never seen Japanese to Korean translations. This is a zero-shot translation.

--

--

Nwamaka Imasogie
Nwamaka Imasogie’s Machine Learning and Artificial Intelligence Projects

fractional CTO (https://goto-cto.com) I’ve built and sold companies. I’ve hired engineering teams. I’ve raised capital. I’ve worked with early-stage startups.