Language Processing with Recurrent Models

Bidirectional RNNs, Encoding, Word Embedding and Tips

Jake Batsuuri
Computronium Blog
8 min readMar 16, 2021

--

What's a Bidirectional RNN?

Bidirectional RNN is an RNN variant, that sometimes can increase performance. It is especially useful for natural language processing tasks.

The BD-RNN uses two regular RNNs, one of them where the sequential data is going forward, and one where the data sequences backwards, then merging their representations.

This method doesn’t work very well for timeseries data, since there’s a more abstract meaning to chronological order. For example, it does make sense that more recent events should have more weight in predicting what will happen next.

Whereas in language related problems, its clear that “cat in the hat” and “tah eht ni tac” should have no real higher abstract meaning. “Tah” and “hat” both refer to the same object. Hopefully it’s easy to see that reversing an image of a cat, or flipping it upside down is still an image of a cat.

It’s also kind of funny when we talk about palindromes, like “Bob” or “racecar”.

When we use the same model on a reversed text, we get very similar results in terms of accuracy. Which is great to see, but what’s even cooler is that, the model accomplishes the task by learning very different representations than the one trained forward.

Thankfully there’s a dedicated layer object, that creates a second instance and reverses the data and trains it and merges it for us. So we don’t have to write code for it.

And with some regularization this model can approach 90% accuracy, which is awesome.

Let’s do a simple text processing example:

Our model can only differentiate numeric data, so we first need to convert our text data into vectors and tensors. We can do this at 3 different levels:

  • Character level
  • Word level
  • N-gram level

We take each level and just assign a unique vector to it. Now this numeric vector is encoded to this phrase and we can encode and decode to it. The unique abstraction may be called a token, and the process is called tokenization.

For example, a large corpus in English may have 26 characters for each alphabet letter. You can create a frequency for each characters. Now each of the 26 characters are tokens.

At the word level, the same corpus, may have thousands of words. Common words like “the”, “in” may occurs more than once. But nevertheless, we will encode each occurrence as the same vector.

At the n-gram level, where n=2, we create a 2 word phrase from every consecutive pair. And from this we can again create frequency table, some bigrams might occur more than once. We will encode each bigram as a unique token, and encode it with a numeric vector. The frequency table is not important here, I just provide it to illustrate the nature of it.

Once we have the abstraction level (characters, words, n-grams) decided and tokenization complete. We can decide how to vectorize the tokens. We can either:

  • One hot encode
  • Token embed

For one hot encoding, we simply count all the unique words in the text, call this N. Then assign a unique integer under N to a word. As long as there are no collisions. We good. We can do this at word and n-gram levels to.

For word level, you can naively implement it as such yourself, or use the prebuilt keras methods to do it:

In keras:

One additional thing to consider is sometimes we may cut the lower tail of the words in terms of distribution, and say take only the most frequent 1000 words. Because this saves us compute time.

To decode a one hot encoded index:

In word embeddings, instead of our vectors looking like [0, 0, 0, … 1, 0], we create vector that look more like [0.243, 0.33454, … 0.5553].

While hot encoded vector can be of size 1000, embedded vectors can be much, much smaller.

How do we learn these fractional elements of the vector though?

We can do it at the same time as our main task on the data that we have, or use a pretrained word embeddings. What’s nice about the embeddings is that they learn meanings of words.

How do we know this?

Remember that vectors can be mapped to a geometric space. And if you draw the word embedded vectors into a geometric space we start to see geometric relations between related words.

Why is it that its theoretically better to train the word embeddings with your training data or in context that is closer to the task you have at hand?

Well languages aren’t isomorphic, English and Russian don’t have the same mappings. Features that exist in one language may not exists entirely in the other.

Furthermore, between two English speakers, they might not agree on the definition of words, therefore the semantic relationship of that word in relation to other words.

Even further, the same person might use a word differently in different contexts. So context matters a lot in semantics.

Let’s embed some words:

1000 and 64 signify in a way how big your one hot vectors would have been and how big are they now. One hot encoding is like digital signals and word embeddings is like the analog with continuous signals. Except that it’s trained to become analog, then it freezes it and uses it as a digital unique signal.

We can just use the word embeddings and a dense classifier to see what kind of accuracy we get:

This gives us 76% accuracy. Not bad.

What if we used pretrained word embeddings?

Before we do that we need to get the labels:

Word2vec is one of the first and most successful such pretrained word embeddings. Another great one is the GloVe.Then we vectorize using the pretrained word embeddings with GloVe:

We need to manually download the GloVe embeddings here, then:

To illustrate some points we first define our model:

We need to load the word embeddings into the embedding layer:

We freeze the word embeddings, cuz we don’t wanna mess with their nice structure. But we need the embedding matrix, to get that:

Now we are ready to train:

The validation accuracy reaches about 50%. Which we can improve with LSTM or GRU, and maybe even fine tuning the word embeddings after the LSTM has been trained. Remember we can do this by unfreezing layers, just like the last layers of a convolutional model.

Finally remember that we picked 200 training samples. That’s too few.

Other Articles

Up Next…

Coming up next is probably LSTM Text Generation. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--