Take a peek at Deep Reinforcement Learning for NLP

Published in

Analytics Vidhya

6 min readSep 27, 2019

This blog will focus on developing a basic understanding of architecture for deep neural networks designed to handle state and action spaces characterized by natural language. The experimental results of this blog are referenced by the Microsoft Research work on Deep Reinforcement Learning with a Natural Language Action Space.

What is Reinforcement Learning?

So as always this blog of mine too will start from scratch, we’ll start by understanding what is Reinforcement Learning.

Reinforcement Learning is an area of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior, this is known as the reinforcement signal.

It is just how humans started to learn things, how we evolved by knowing what is right and what is wrong. Take an example of a newly born baby, he has no clue what is going on? what to do? he might burn down the whole house and think it’s fun, only if he could :p. As he grows, his parents stop him from doing things that are not meant to be done, he receives negative feedback from his mother when he pees on the couch. He receives positive feedback when he greets the guests. That is what reinforcement learning is!!

The kid is Agent and his parents are the environment. The Agent performs a task and receives a reward corresponding to it (either negative or positive). The idea is to train machines such that they think and behave like a human.

Reinforcement Learning allows the machine or software agent to learn its behavior based on feedback from the environment. This behavior can be learned once and for all, or keep on adapting as time goes by. If the problem is modeled with care, some Reinforcement Learning algorithms can converge to the global optimum; this is the ideal behavior that maximizes the reward.

Reinforcement learning model:

environment state set: S
Action set: A
rules of transition between states
rules that determine the immediate reward for the state transition
rules that describe what the agent observes

Q-Learning:

It is used to Learn the policy for Reinforcement Learning.
Policy: a rule that the agent should follow to select actions given the current state.
Q-Learning: find the optimal policy for the decision process.
Approach: learning an action-value function, a.k.a. Q function, that computes the expected utility of taking an action in a state after training converges.
Q-function[Q(s,a)]: returns Q-value for action a at state s.

Q-Value:

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value provides a means to control the positive false discovery rate(pFDR). Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

I am pretty sure some of you have heard of the game “AlphaGo”, or must have heard about some bot beats the world champions. All credits to the Deep-Q-Network.

Now, the thing to understand is when training the learner for games like AlphaGo and Chess, the agent has a very small action space, but large state space. For example, in the game of chess, the state space is the whole 20X20 board, but the action space is small like a pawn can move only in 1–2 directions. Deep-Q-Network has been proven to be very effective.

But what if we talk about Reinforcement Learning for language understanding?

Reinforcement Learning for language understanding

Sequential decision-making problem for text understanding:

E.g., Conversation, task-completion, text-based games…
Agent observes state as a string of text at a time t, e.g., state-text s(t).
Agent also knows a set of possible actions, each describes as a string text, e.g, action-texts
Agent tries to understand the “state text” and all possible “action texts”, and takes to right action — right means maximizing the long term reward.
Then, the environment state transits to a new state, agent receives an immediate reward.

Unbounded action space in RL for NLP:

Not only the state space is huge, but the action space is also huge too. Action is characterized by unbounded natural language descriptions. For example, if say to the model “Hey! how are you doing? I was just waiting for F.R.I.E.N.D.S to stream but the power is gone”. Well, this input text from me is the state-space for the model(quite heavy), and the action space is every text combination available(or infinity). This problem for such a huge action space was still the problem in Deep-Q-Network. Then Deep Reinforcement Relevance Network(DRRN) got proposed.

Deep Reinforcement Relevance Network(DRRN):

The idea of DRRN is to project both the state and action into a continuous space(as vectors). Q-function is a relevance function of the state vector and action vector.

Figure 2 illustrates learning with an inner product interaction function(application of DRRN). We used Principal Component Analysis (PCA) to project the 100- dimension last hidden layer representation (before the inner product) to a 2-D plane. The vector embeddings start with small values, and after 600 episodes of experience-replay training, the embeddings are very close to the converged embedding (4000 episodes). The embedding vector of the optimal action (Action 1) converges to a positive inner product with the state embedding vector, while Action 2 converges to a negative inner product.

The Figure above shows the learning curves of different models, where the dimension of the hidden layers in the DQNs and DRRN are all set to 100. After around 4000 episodes of experience-replay training, all methods converge. The DRRN converges much faster than the other three baselines and achieves a higher average reward. We hypothesize this is because the DRRN architecture is better at capturing relevance between state text and action text. The faster convergence for “Saving John” may be due to the smaller observation space and or the deterministic nature of its state transitions.

Things we discussed:

We discussed about Reinforcement Learning and how Deep-Q-Network(DQN) performed extremely well in tasks with small action space(AlphaGo).
Why and How Deep Reinforcement Learning for NLP(e.g. text-based games) is different than a regular game with small action space.
Discussed about unbounded action space in Reinforcement Learning for NLP and how Deep Reinforcement Relevance Network(DRRN) converged faster in the experiments done on two text-based games(Saving John and Machine of Death).