# Reinforcement Learning

Today’s blog post is about Reinforcement Learning (RL), a concept that is very relevant to Artificial General Intelligence. The goal of RL is to create an agent that can learn to behave optimally in an environment by observing the consequences — rewards — of its own actions. By incorporating deep neural networks into RL agents, computers have accomplished some astonishing feats in recent years, such as beating humans at Atari games and defeating the world champion at Go. In this blog post we will briefly cover the technical insights behind some of the most recent and exciting developments in RL. If you’re already familiar with RL you can skip straight to the “Recent extensions to DQN” section.

#### Key concepts

Reinforcement learning is essentially learning by interaction with the environment. In an RL scenario, a task is specified implicitly through a scalar reward signal. An RL agent learns from the consequences of its actions, rather than from being explicitly given ideal actions. Similar to the way humans learn by trial and error, an RL agent selects its actions based on its past experiences (exploitation) and also by trying new choices (exploration). The challenge of determining which of the preceding actions contributed to getting the reward is known as the credit assignment problem. The most common way to formalise a RL problem is to represent it as a Markov Decision Process (MDP), where the agent is aware of its states, or a Partially Observable Markov Decision Process (POMDP), where the agent has some beliefs about the states. Q-learning is a technique that is used in RL to find an optimal action selection policy by modelling these states. It works by learning a Q-function *Q(s,a)* that represents the expected future reward (i.e. the long-term reward, subjected to some discounting factor to account for uncertainty of future action and state transitions) when taking a given action *a *in a given state *s*, and following the optimal policy thereafter. In other words, it computes the Quality, or expected value, of taking action *a* in state *s *and then following the best policy afterwards. In equation form, the function looks like this:

*Q(s,a) = r + γ(max(Q(s’,a’))*

where *r* is the immediate reward and *γ* is the discount factor. This is called the Bellman equation. Intuitively, it says the maximum future reward for the current state and action is the immediate reward, plus expected future rewards for subsequent states.

#### From shallow to deep Q-learning

The simplest way for an agent to decide which action to choose is by consulting a look-up table. In this implementation, Q-learning is a table or matrix with states as rows and actions as columns. Within each cell of the table is the Q-value (i.e. the quality) of each given action given the state. This table-based Q-learning is a straightforward lookup process that answers the question: “when I am in state *s*, what is the best action to take?”. This Q-table is updated throughout the agent’s lifetime while it learns by interacting with its environment.

However, what if our state space is too big? The number of possible states in a real-world scenario is almost infinitely large, making it impossible to learn Q-value estimates for each state and action pair independently. Tables simply won’t work in such cases. Most states will never be visited, or may take a long time to be visited repeatedly, resulting in incorrect or sub-optimal Q-values occupying a majority of the cells. We need a more efficient way to generalise the Q-values we can feasibly collect. This is where neural networks help. By representing the Q-function with a neural network, we can take an arbitrary number of states that can be treated as a vector and learn to map them to Q-values. The parameters of these networks can be trained by gradient descent to minimise some suitable loss function, i.e. by comparing predicted Q-values to observed Q-values. A big advantage of using neural networks is the possibility to expand the network with additional convolutional layers, hence giving it a high degree of representational flexibility and superior generalisation capabilities. Thanks to the advantages afforded by deep learning (Eldan and Shamir wrote an excellent paper on the power of depth in Neural Networks [1]), these deep RL (sometimes referred to as Deep Q-Network, or DQN) networks are the dominant approach today. Google DeepMind’s latest self-taught AI system AlphaGo Zero [2], for example, uses DQN.

Figure 1. Two implementations of DQN. On the left, the network takes state and action as input and outputs the corresponding Q-value. On the right, the network takes only states as input and outputs the Q-values for each possible action. (Image credit: Tambet Matilsen)

#### Recent extensions to DQN

Various improvements on deep RL have been made in recent years. Below we list some of the most significant.

#### 1. Double DQN

One known problem with conventional Q-learning algorithms is that it often overestimates the Q-values of the potential actions to take in a given state. This harms the quality of the resulting policies. To correct for this, Hasselt et al. developed Double DQN [3]. The idea is to decouple the action choice from the action evaluation. Instead of taking the max over Q-values when computing the target Q-values, the primary network is used to select an action and the target network is used to generate the target Q-value for that action. No additional networks or parameters are needed. Results showed that their Double DQN reduced overestimation of Q-values and was able to train faster and more reliably.

#### 2. Prioritised DQN

One important technique when implementing DQN is Experience Replay, where experiences gained during gameplay are stored in a replay buffer. The network is then trained by drawing random samples from this buffer and using them for training. In conventional DQN, experience transitions are sampled uniformly from the replay buffer. Ideally, we want the more significant transitions to be sampled more frequently. Schaul et al. developed a framework for prioritising experience in order to replay important transitions more frequently [4]. The proxy they used for importance is the magnitude of the transition error, which indicates how ‘surprising’ a transition is. Their prioritised experience replay network outperformed DQN in 41 out of 49 Atari games.

#### 3. Dueling DQN

The idea behind Wang et al.’s Dueling DQN is to have a network architecture that has two separate representations of the state values and (state-dependent) action advantages [5]. The value and advantage functions are computed separately and then combined via a special aggregating layer only at the final layer. The key benefit of such a dueling network is that the network can learn which states are (or are not) valuable without having to learn the effect of every action for every state. By decoupling the state value from the necessity of being attached to specific actions more robust estimates can be achieved.

#### 4. Distributional DQN

In a Distributional DQN, the model learns to approximate the distribution of returns instead of the expected return. Bellemare et al. argue that learning distributions matters in the presence of randomness, which often causes instability that may prevent the policy from converging. They proposed a model that applies a variant of Bellman’s equation to the learning of approximate value distributions [6]. The authors wrote an excellent blog post to explain the intuition behind their idea. Their model obtained state-of-the-art results when evaluated using the Arcade Learning Environment while providing hints supporting the importance of value distribution in approximate RL.

#### 5. Asynchronous learning

In Asynchronous learning, the agent learns from parallel copies of the environment. The Asynchronous Actor Critic Agent (A3C) model introduced by Mnih et al. implements a global network, where multiple agents in the network interact with their own environment simultaneously as the other agents [7]. Training becomes more diverse as the experience of each agent is independent of the experience of the others. The A3C achieved state-of-the-art results within half the training time and succeeded in a broad range of other continuous motor control problems. A Tensorflow + Keras implementation of A3C can be found here.

Figure 2. An Asynchronous DQN implementation in Atari Learning Environment (Image credit: Corey Lynch)

#### 6. Rainbow

Hessel et al. combined six DQN extensions into one single ‘Rainbow’ model, including the aforementioned Double, Prioritised, Dueling, Distributional DQN and A3C [8]. They demonstrated that the extensions are largely complementary and their integration resulted in new state-of-the-art results on the benchmark suite of 57 Atari 2600 games.

#### 7. Deep Recurrent Q-Network

Conventional DQN relies on the agent being able to perceive the entire game screen at any given moment. However, real-world scenarios are often incomplete or imperfectly measured, e.g. they could be spatially or temporally limited. This results in uncertainty. Instead of a Markov Decision Process (MDP), the task becomes a Partially-Observable MDP. Hausknecht and Stone first observed that DQN’s performance suffers when given incomplete state observations and proposed a combination of Long Short Term Memory (LSTM) and DQN to form a Deep Recurrent Q-Network (DRQN) [9]. Using this baseline DRQN, Lample et al. augmented their model with additional game feature information and presented an enhanced DRQN model to tackle 3D environments in first-person shooter games [10].

#### 8. Imagination-Augmented Agents (I2A)

Weber et al. attempted to bridge the gap between model-based and model-free learning by designing Imagination-Augmented Agents (I2A) [11]. Their approach uses approximate environment models by learning to interpret their imperfect predictions. They use these environment models to simulate ‘imagined trajectories’, which are then interpreted by a neural network and given as additional information to a policy network. Compared to a few baselines, I2As demonstrated better data efficiency and performance.

#### 9. Hybrid Reward Architecture (HRA)

Generalisation is one of the main RL challenges when applied to real-world problems with a large state-space. This is typically dealt with in deep RL by approximating the optimal value function with a low-dimensional representation. However, learning often becomes unstable when the value function is very complex and cannot be readily reduced to a low-dimensional representation. Van Seijen et al. proposed a new Hybrid Reward Architecture (HRA) to tackle this problem [12]. HRA decomposes the reward function into separate reward functions, each of them assigned to a different RL agent. The agents learn in parallel by using off-policy learning (learning that is independent of the agent’s actions) and each agent has its own policy (with different parameters) associated with it. To obtain the final policy from these multiple agents, an aggregator combines them into one single policy by for example averaging over all agents. Their model beat human performance on a toy-problem and the Atari game Ms. Pac-Man.

#### 10. Exploration-exploitation strategies

Another major challenge in RL is the exploration-exploitation dilemma, which involves the trade-off between two conflicting objectives: investing effort in finding new states and learning accurate Q values (exploration), versus utilising what has already been learned (exploitation). Various techniques have been applied to approach this issue. Furtunato et al. developed the NoisyNet that uses induced stochasticity (perturbations of the network weights) of the agent’s policy to improve exploration efficiency [13]. Osband et al. introduced a method that implements randomised value functions to incentivise experimentation with actions that are highly uncertain, therefore encouraging deep exploration [14]. Lin et al. seeked to reduce the number of training episodes by introducing human feedback in deep RL: the model balances between listening to human feedback, exploiting current policy model and exploring the environment [15]. Nair et al. studied the problem of exploration in environments with sparse rewards (e.g. robotics) by combining RL with imitation learning (i.e. demonstrations) [16]. There are also approaches that deal with the nature of the reward itself, known in cognitive psychology as *intrinsic motivation*. Pathak et al., for example, used curiosity as an intrinsic reward signal to prompt the agent to explore its environment [17]. They formulated curiosity as the error of an agent’s ability to predict the consequence of its own action. Another strategy by Kulkarni et al. implemented a hierarchical DQN (h-DQN) that utilises intrinsic motivation [18]. Their model learns over two levels of hierarchy: a top level meta-controller that learns the environment and intrinsically generates a new goal, and a lower-level controller that learns a policy to satisfy the chosen goals.

These are just a few significant enhancements to deep RL. The list is by no means complete, and somewhat subjective! It is also worth mentioning that although they have mostly been developed in game-play scenarios, deep RL algorithms have increasingly found application in many other fields. Conversational AI based on deep RL models, for example, is an active area of research [19, 20, 21]. Any exciting DQN applications that you know of? Have we missed any important RL developments? Comment below and let us know!

#### REFERENCES

[1] Eldan, R., & Shamir, O. (2016). The power of depth for feedforward neural networks. In Conference on Learning Theory (pp. 907–940).

[2] Silver, D., Julian Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., … & Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550, 354–359.

[3] Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. In AAAI (pp. 2094–2100).

[4] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

[5] Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581.

[6] Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887.

[7] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (pp. 1928–1937).

[8] Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., … & Silver, D. (2017). Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv preprint arXiv:1710.02298.

[9] Hausknecht, M., & Stone, P. (2015). Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527.

[10] Lample, G., & Chaplot, D. S. (2017). Playing FPS Games with Deep Reinforcement Learning. In AAAI (pp. 2140–2146).

[11] Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., … & Pascanu, R. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. arXiv preprint arXiv:1707.06203.

[12] van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., & Tsang, J. (2017). Hybrid Reward Architecture for Reinforcement Learning. arXiv preprint arXiv:1706.04208.

[13] Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., … & Blundell, C. (2017). Noisy Networks for Exploration. arXiv preprint arXiv:1706.10295.

[14] Osband, I., Russo, D., Wen, Z., & Van Roy, B. (2017). Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608.

[15] Lin, Z., Harrison, B., Keech, A., & Riedl, M. O. (2017). Explore, Exploit or Listen: Combining Human Feedback and Policy Model to Speed up Deep Reinforcement Learning in 3D Worlds. arXiv preprint arXiv:1709.03969.

[16] Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2017). Overcoming Exploration in Reinforcement Learning with Demonstrations. arXiv preprint arXiv:1709.10089.

[17] Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. arXiv preprint arXiv:1705.05363.

[18] Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems (pp. 3675–3683).

[19] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., & Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.

[20] Aggarwal, M., Arora, A., Sodhani, S., & Krishnamurthy, B. (2017). Reinforcement Learning Based Conversational Search Assistant. arXiv preprint arXiv:1709.05638.

[21] Serban, I. V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., … & Mudumba, S. (2017). A Deep Reinforcement Learning Chatbot. arXiv preprint arXiv:1709.02349.

*Originally published at **Project AGI**.*