LeDeepChef 👨🏻🍳 — Deep Reinforcement Learning Agent for Families of Text-based Games
We developed a deep reinforcement learning agent that generalizes across multiple different text-based games sharing a similar theme. The agent participated in Microsoft Research’s First TextWorld Problems: A Language and Reinforcement Learning Challenge and achieved the high-score 🥇 on the (hidden) validation set and beat all but one competitor 🥈 on the final test set.
What are text-based games?
Text-based games (TBGs) are computer games played in the terminal where the sole modality of interaction is text. In an iterative process, the player issues commands in natural language and, in return, is presented with some kind of answer that represents a (partial) description of the environment.
Games like Infocom’s Zork have been hugely popular in the 1980s. Today, these type of games are less hyped by gamers but of great interest in AI-research to advance the state of reinforcement learning (RL) agents. TBGs are at the promising — yet relatively undiscovered — intersection of RL and natural language processing (NLP). Due to their limited vocabulary and relatively restricted setting, they are considered a useful proxy for learning real-world open-dialogue agents.
I like to see them as the Atari games for NLP because much like Atari games, Go, and Starcraft, solving them was never the end-goal of Deepmind but was always considered a stepping-stone on the way to develop a more general AI. The idea that research for all of these games share is to come up with RL agents that perform well in a restricted world and then gradually move to more complex environments. This is precisely the reason why TBGs could prove themselves as important for NLP research.
First TextWorld Problems
Even though games like Zork are fun to play, they are not ideal as a test-bed for modern RL agents. Why is that? Mainly because it’s always the same game. There is no variation in how the rooms are connected or the task to complete. Moreover, the task at hand requires complex problem-solving abilities and a good amount of reasoning, to make it interesting for a human player.
A more promising setting — from a research perspective — is to have a framework in which an agent can learn a specific skill that it can then prove to generalize to an unseen environment. This is much more like human skill acquisition: you probably have learned to prepare pasta in your kitchen at home, but would have no trouble transferring this skill and cooking pasta at your friend’s house.
With Microsoft’s TextWorld framework, it is easy to construct such families of TBGs. They share the same theme and similar goals but differ enough to require the agent not just to learn the steps by heart, but requiring it to generalize across multiple scenarios. The First TextWorld Problems: A Language and Reinforcement Learning Challenge was designed to specifically foster research at the intersection of RL and NLP with agents that generalize across a whole family of similar TBGs. The following video gives an overview of the competition.
The competition is based on TBGs from the TextWorld framework that all share a similar theme, namely cooking in a modern house environment. The agent is placed in a random initial room and is asked to find the cookbook, read it, and then prepare the specified recipe.
The recipe contains several ingredients, that need to be collected in the rooms, and directions, which need to be executed to prepare the meal.
Throughout the game, there are multiple obstacles to overcome, including navigating through different connected rooms, dealing with closed doors and objects, finding the right tools, e.g., a knife or a stove, as well as collecting the right ingredients.
On the left, you see the purest example of a game for the competition. The basic steps are:
- Find the cookbook & read the recipe.
- Find & take all the necessary ingredients (+ tools).
- Execute the missing recipe directions with the right tools.
If, by now, you are curious to play it yourself, go ahead and try it: https://www.microsoft.com/en-us/research/project/textworld/#!try-it.
Now that we have an intuition about the problem we’re facing, let’s cut to the chase — how did we build an RL agent to solve TBGs?
First of all, we separated the command generation process and the command ranking. We basically have two parts in our system: the first is a model that takes care of generating a set of reasonable commands (in the order of 3–15 commands) given the context at any step, and the second — the agent — is trained to rank the presented commands based on their expected future reward. Let’s first look into the details of the agent.
We train an agent to select — at every step in the game — the most promising command (in terms of discounted future reward) from a list of possible commands, given the observed context. The observed context is a proxy for the (non-observable) game’s state, that we construct using a set of purely textual features, that are either a direct response from the environment or are somehow constructed features. A detailed list of all eight input features can be found in the paper.
From the list of textual input features (Observation, …, Location) and the list of reasonable commands, we build a model that (i) scores the current game’s state and (ii) ranks the set of commands. The following figure illustrates the architectural design of the model.
Again, a detailed explanation of the individual steps is out of scope for this post and can be found in the paper. For now, we can just see the model as a black-box that, given a set of commands (of varying length), spits out a probability of execution for each of them, as well as evaluates the overall state of the game.
If you want to see the agent in action, I highly recommend you check out the notebook in my FirstTextWorldProblems repository on GitHub. There, you can play through a game and see the ranking done by the agent at every step:
Training with Actor-Critic
The training signal for our model comes in the form of an increased score upon completion of a (sub-)task. Not every action results in an immediate reward and, therefore, we encounter the problem of long-term credit-assignment. To cope with it, we use an online actor-critic algorithm, that computes the reward at time-step t over a session of length T, using the temporal difference method.
If you want to learn more about how actor-critic learning works, I recommend the excellent blog post by Chris Yoon.
Let me summarize the RL training approach at a high level. The objective to optimize is a linear combination of three individual terms, namely the policy loss, value loss, and entropy loss. The policy term updates the weights of the actor (parameters involved to compute the command rankings); it encourages (penalizes) the current policy if it led to a better (worse) than “average” reward. The value term, loosely speaking, tries to drive the game’s state value, predicted by the model, close to the discounted “long-term” reward. Finally, the entropy loss encourages exploration by penalizing a low-entropy ranking of the commands through the agent.
One of the major challenges in TBGs is the construction of possible — or rather reasonable — commands in any given situation. Due to the combinatorial nature of the actions, the size of the search space is vast. Thus, brute force learning approaches are infeasible, and RL optimization is extremely difficult. To solve this problem, we make use of ideas from hierarchical RL. We effectively reduce the size of the action-space by combining multiple actions to “high-level” actions. Moreover, we train a helper model, that is specialized in predicting the remaining cooking actions to complete the recipe. To make this module resilient against unseen recipes and ingredients, we train it with a data-set augmented by the most popular food items from the freebase database.
One crucial part of the agent is to find out which tasks it still needs to perform, i.e., which cooking directions need to be performed. To this end, we train a model that, given the recipe instructions and the agent’s inventory, predicts what actions to perform. The image on the left illustrates the process. Each recipe direction, as well as the whole inventory, is encoded using a GRU. We then concatenate the respective final hidden states and use an MLP to predict the ‘likelihood’ of the action.
Moreover, we train an additional model that categorizes for every ingredient whether or not it is necessary for the successful completion of the task and still need to be picked up. We use this predicted information to reduce the size of the action-space by grouping commands. For example, instead of individual pick-up commands like take red hot pepper or take water, we present the agent with the ‘high-level’ command take all required items, and, when chosen, execute both take commands. This approach makes the agent operate on a higher level of abstraction, makes him more resilient to unseen ingredients, and massively shortens exploration time.
While this explains the central concept of how we generate the commands, we refer the interested reader to the main paper for details.
In the TextWorld challenge, our agent was evaluated against more than 20 competitors. First, on a validation set of more than 200 unseen games. Here our agent achieved the highest score.
Then, after the end of the competition, it was evaluated on a, more extensive and harder, unseen set of games, where our agent came in 2nd.
Moreover, in our paper, we show how using previously proposed methods for TBGs ‘out-of-the-box’ does not work for this new task. The change in environment, as well as unseen ingredients and tasks, leads to the inferior performance of baselines like LSTM-DQN or DRRN. Again, for more extensive experimental results and a more thorough comparison, we refer to the main paper.
In my opinion, Microsoft’s TextWorld challenge was a massive success as it made aware of the underexplored research challenge of solving families of text-based games. Moreover, it lead to multiple agents improving upon standard baseline methods.
Bringing RL methods to NLP tasks holds great promises for the future, and the development of agents that generalize to unfamiliar environments is an essential step on this path.
With our agent, we showed how to cope with huge action-spaces and designed a model that learns to generalize to never-before-seen games of the same family. To achieve this result, we designed a model that effectively ranks a set of commands based on the context and context-derived features. By incorporating ideas from hierarchical RL, we signiﬁcantly reduced the size of the action-space and were able to train the agent through an actor-critic approach.