Sampling Strategies for Recurrent Neural Networks
Recurrent Neural Networks are currently one of the most powerful Machine Learning models. They are the method behind many advances in speech recognition, machine translation and natural language processing.
What makes Recurrent Neural Networks great is that they can not only detect things, you can also use them to generate sequences. (For example, check my Python Code Prediction API or this nice blog post by Andrej Karpathy).
A Recurrent Neural Network learns to predict the next words or characters given some input. From a high level, this works like in the following picture:
In the example, first the word The is the most likely. Then, given **The**, the word Cat is most likely. Then, given The Cat, The Cat Is is most likely, etc.
However, when learning, the output of an RNN is a probability distribution instead of one word.
When generating text we choose only one of the words ourselves given the probabilities and feed that back into the network. This is called sampling.
The most simple form of sampling is greedy search. At each point, it will pick the most likely next word. For example The ⇾ The Cat ⇾ The Cat Is ⇾ The Cat Is Home. You’ll see it in the next picture:
However this method has a few drawbacks. First it will generate only one sample, if you would like to have more samples to choose from: bummer! Furthermore it can not look ahead: it could for example have created The Cat The Cat instead of The Cat Is Home, because The after The Cat had the highest probability. But it couldn’t see that The Cat Is Home was better than anything it cloud create after The Cat The. For example in a real world example you can get something like this as output:
and the senstic and the some the senses and the some and the senses and and the sense the such and the such there as a sore and all the some and there and and and and and the senses of the such the sence of the some and and the some and and the such and the sense and the some the more and the some and the sinces with the socrence and the some and a concertions and the some and as
Not very original, is it? Note the repetitions of for example the word and and a lot of some, such and senses.
A popular way to avoid this problem is using beam search. It works like this: instead of selecting the best next word it selects k next words. k is called the beam width of this method.
This could be an example of a beam search:
You see, it can generate multiple samples, and the best of those are often better than samples created using greedy search.
Now on to random sampling. Random search works by selecting the next word with the probabilities of each word taken from the output of the model (the probability distribution). For example: the word The has probability 0.8, so 8 out of 10 times the next word is The, 2 out of 10 times it will generate another word. Look at this picture for example:
The nice thing about random search is that the results are very diverse, however, often the results are too diverse, they look too random. To fix this randomness, we increase the probability of the most probable words, and decrease the probabilities of less probable ones using temperature. The higher the temperature, the more diverse results it gives and vice versa.
I walked you through some different popular sampling strategies for Recurrent Neural Networks. It would be interesting to see whether there are better techniques than these methods that are often used. Probably, we currently miss some part and need also to learn a grader, or something like Generative Adversial Networks. Who knows, I’ll write about it in some next article!