Deep Reinforcement Learning: Value Functions, DQN, Actor-Critic method, Back-propagation through stochastic functions

11 min readAug 3, 2020

Deep Reinforcement Learning is a new research track within the field of Machine Learning.
While neural networks are responsible for recent breakthroughs in problems like computer vision, machine translation and time series prediction — they can also combine with reinforcement learning algorithms to create something astounding like AlphaGo [9].

Reinforcement algorithms that incorporate deep learning can beat world champions at the game of Go as well as human experts playing numerous Atari video games [8]. While that may sound trivial, it’s a vast improvement over their previous accomplishments, and the state of the art is progressing rapidly.

Like a human, our agents learn for themselves to achieve successful strategies that lead to the greatest long-term rewards. This paradigm of learning by trial-and-error, solely from rewards or punishments, is known as Reinforcement Learning (RL). Also like a human, our agents construct and learn their own knowledge directly from raw inputs, such as vision, without any hand-engineered features or domain heuristics. This is achieved by Deep Learning of neural networks. Many of the successes in DRL have been based on scaling up prior work in RL to high-dimensional problems. This is due to the learning of low-dimensional feature representations and the powerful function approximation properties of neural networks. By means of representation learning, DRL can deal efficiently with the curse of dimensionality, unlike tabular and traditional nonparametric methods [7]. For instance, convolutional neural networks (CNNs) can be used as components of RL agents, allowing them to learn directly from raw, high-dimensional visual inputs. In general, DRL is based on training deep neural networks to approximate the optimal policy π* and/or the optimal value functions V*, Q*, and A*.

Value Functions

The well-known function approximation properties of neural networks led naturally to the use of deep learning to regress functions for use in RL agents. Indeed, one of the earliest success stories in RL is TD-Gammon, a neural network that reached expert-level performance in backgammon in the early 1990s [81]. Using Temporal-Difference (TD) Learning methods, the network took in the state of the board to predict the probability of black or white winning. Although this simple idea has been echoed in later work [73], progress in RL research has favoured the explicit use of value functions, which can capture the structure underlying the environment. From early value function methods in DRL, which took simple states as input [61], current methods are now able to tackle visually and conceptually complex environments [8], [15], [16].

Function approximation and the DQN

The value-function-based DRL algorithms with the DQN [8], illustrated in Fig 6.1, achieved scores across a wide range of classic Atari 2600 video games [17] that were comparable to that of a professional video games tester. The inputs to the DQN are four grey-scale frames of the game, concatenated over time, which are initially processed by several convolutional layers to extract spatiotemporal features, such as the movement of the ball in Pong or Breakout. The final feature map from the convolutional layers is processed by several fully connected layers, which more implicitly encode the effects of actions. This contrasts with more traditional controllers that use fixed pre-processing steps, which are therefore unable to adapt their processing of the state in response to the learning signal.

A forerunner of the Deep Q Network-neural-fitted Q (NFQ) iteration-involved training a neural network to return the Q-value given a state-action pair [18]. NFQ was later extended to train a network to drive a slot car using raw visual inputs from a camera over the race track, by combining a deep autoencoder to reduce the dimensionality of the inputs with a separate branch to predict Q-values [19]. Although the previous network could have been trained for both reconstruction and RL tasks simultaneously, it was both more reliable and computationally efficient to train the two parts of the network sequentially. The DQN [8] is closely related to the model proposed by Lange et al. [19] but was the first RL algorithm that was demonstrated to work directly from raw visual inputs and on a wide variety of environments. It was designed such that the final fully connected layer outputs Qπ (s,˙) for all action values in a discrete set of actions — in this case, the various directions of the joystick and the fire button. This not only enables the best action, argmaxaQ (s,a), to be chosen after a single forward pass of the network, but also allows the network to more easily encode action-independent knowledge in the lower, convolutional layers. With merely the goal of maximizing its score on a video game, the DQN learns to extract salient visual features, jointly encoding objects, their movements, and, most importantly, their interactions. Using techniques originally developed for explaining the behaviour of CNNs in object recognition tasks, we can also inspect what parts of its view the agent considers important. The true underlying state of the game is contained within 128 bytes of Atari 2600 random-access memory. However, the DQN was designed to directly learn from visual inputs (210 ×160 pixel 8-bit RGB images), which it takes as the state s. It is impractical to represent Qπ (s,a) exactly as a lookup table: when combined with 18 possible actions, we obtain a Q-table of size S × A =18 × 2563× 210×160. Even if it were feasible to create such a table, it would be sparsely populated, and information gained from one state-action pair cannot be propagated to other state-action pairs. The strength of the DQN lies in its ability to compactly represent both high-dimensional observations and the Q-function using deep neural networks. Without this ability, tackling the discrete Atari domain from raw visual inputs would be impractical.

The DQN addressed the fundamental instability problem of using function approximation in RL by the use of two techniques: experience replay and target networks. Experience replay memory stores transitions of the form (st,at, st+1, rt+1) in a cyclic buffer, enabling the RL agent to sample from and train on previously observed data offline. Not only does this massively reduce the number of interactions needed with the environment, but batches of experience can be sampled, reducing the variance of learning updates. Furthermore, by sampling uniformly from a large memory, the temporal correlations that can adversely affect RL algorithms are broken. Finally, from a practical perspective, batches of data can be efficiently processed in parallel by modern hardware, increasing throughput. While the original DQN algorithm used uniform sampling [8], later work showed that prioritizing samples based on TD errors is more effective for learning.

The second stabilizing method, introduced by Mnih et al. [8], is the use of a target network that initially contains the weights of the network enacting the policy but is kept frozen for a large period of time. Rather than having to calculate the TD error based on its own rapidly fluctuating estimates of the Q-values, the policy network uses the fixed target network. During training, the weights of the target network are updated to match the policy network after a fixed number of steps. Both experiences replay and target networks have gone on to be used in subsequent DRL works.

The network takes the state — a stack of grey-scale frames from the video game — and processes it with convolutional and fully connected layers, with ReLU nonlinearities in between each layer. At the final layer, the network outputs a discrete action, which corresponds to one of the possible control inputs for the game. Given the current state and chosen action, the game returns a new score. The DQN uses the reward — the difference between the new score and the previous one — to learn from its decision. More precisely, the reward is used to update its estimate of Q, and the error between its previous estimate and its new estimate is backpropagated through the network.

Back-propagation through Stochastic Functions

The workhorse of DRL, however, remains back-propagation. The REINFORCE rule allows neural networks to learn stochastic policies in a task-dependent manner, such as deciding where to look in an image to track or caption objects. In these cases, the stochastic variable would determine the coordinates of a small crop of the image and hence reduce the amount of computation needed. This usage of RL to make discrete, stochastic decisions over inputs is known in the deep-learning literature as hard attention and is one of the more compelling uses of basic policy search methods in recent years, having many applications outside of traditional RL domains.

Actor-Critic Methods

Actor-critic approaches have grown in popularity as an effective means of combining the benefits of policy search methods with learned value functions, which are able to learn from full returns and/or TD errors. They can benefit from improvements in both policy gradient methods, such as GAE, and value function methods, such as target networks [8]. In the last few years, DRL actor-critic methods have been scaled up from learning simulated physics tasks to real robotic visual navigation tasks [100], directly from image pixels.

One recent development in the context of actor-critic algorithms is deterministic policy gradients (DPGs), which extend the standard policy gradient theorems for stochastic policies to deterministic policies. One of the major advantages of DPGs is that, while stochastic policy gradients integrate over both state and action spaces, DPGs only integrate over the state space, requiring fewer samples in problems with large action spaces. In the initial work on DPGs, David Silver introduced and demonstrated an off-policy actor-critic algorithm that vastly improved upon a stochastic policy gradient equivalent in high-dimensional continuous control problems. Later work introduced deep DPG, which utilized neural networks to operate on high-dimensional, visual state spaces. In the same vein as DPGs, Heess devised a method for calculating gradients to optimize stochastic policies by “re-parameterizing” the stochasticity away from the network, thereby allowing standard gradients to be used (instead of the high-variance REINFORCE estimator). The resulting stochastic value gradient (SVG) methods are flexible and can be used both with (SVG(0) and SVG(1)) and without (SVG(3)) value function critics, and with (SVG (3) and SVG(1)) and without (SVG(0)) models. Later work proceeded to integrate DPGs and SVGs with RNNs, allowing them to solve continuous control problems in POMDPs, learning directly from pixels. Together, DPGs and SVGs can be considered algorithmic approaches for improving learning efficiency in DRL.

An orthogonal approach to speeding up learning is to exploit parallel computation. By keeping a canonical set of parameters that are read by and updated in an asynchronous fashion by multiple copies of a single network, computation can be efficiently distributed over both processing cores in a single central processing unit (CPU), and across CPUs in a cluster of machines. Using a distributed system, a framework was developed for training multiple DQNs in parallel, achieving both better performance and a reduction in training time. However, the simpler Asynchronous Advantage Actor-Critic (A3C) algorithm, developed for both single and distributed machine settings, has become one of the most popular DRL techniques in recent times. A3C combines advantage updates with the actor-critic formulation and relies on asynchronously updated policy and value function networks trained in parallel over several processing threads. The use of multiple agents, situated in their own, independent environments, not only stabilizes improvements in the parameters, but conveys an additional benefit in allowing for more exploration to occur.

There have been several major advancements on the original A3C algorithm that reflect various motivations in the field of DRL. The first is actor-critic with experience replay, which adds off-policy bias correction to A3C, allowing it to use experience replay to improve sample complexity. Others have attempted to bridge the gap between value and policy-based RL, utilizing theoretical advancements to improve upon the original A3C. Finally, there is a growing trend toward exploiting auxiliary tasks to improve the representations learned by DRL agents and, hence, improve both the learning speed and final performance of these agents.

Applications

There only a few applications as Deep Reinforcement Learning is mostly in experimental phase. Some of it includes,

DeepMind’s algorithms to play Atari Games
AlphaGo: In October 2015, AlphaGo became the first program to defeat a professional human player. In March 2016, AlphaGo defeated Lee Sedol (the strongest player of the last decade with an incredible 18 world titles) by 4 games to 1, in a match that was watched by an estimated 200 million viewers.
Gaming

Scope of Deep Reinforcement Learning

Neural Scene Representation & Rendering to be implemented in Self-Driving Cars by DeepMind.

**DeepMind’s Neural Rendering of a maze**

Xue Bin (Jason) Peng and Angjoo Kanazawa framework for agents to learn acrobatics from YouTube videos.

References

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
[2] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Proc. IEEE Int. Conf. Robotics and Automation, 2004, pp. 2619–2624.
[3] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang, “Autonomous inverted helicopter flight via reinforcement learning,” in Proc. Int. Symp. Experimental Robotics, 2006, pp. 363–372.
[4] S. Singh, D. Litman, M. Kearns, and M. Walker, “Optimizing dialogue management with reinforcement learning: Experiments with the NJFun system,” J. Artificial Intell. Res., vol. 16, pp. 105–133, Feb. 2002.
[5] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman, “PAC model-free reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2006, pp. 881–888.
[6] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[7] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[10] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” J. Mach. Learning Res., vol. 17, no. 39, pp. 1–40, 2016.
[11] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” in Proc. Int. Symp. Experimental Robotics, 2016, pp. 173–184.
[12] www.indiegogo.com/projects/pillo-your-personal-home-health-robot#/
[13] www.indiegogo.com/projects/alpha-2-the-first-humanoid-robot-for-the-family
[14] www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/
[15] G. Tesauro, “Temporal difference learning and TD-gammon,” Commun. ACM, vol. 38, no. 3, pp. 58–68, 1995.
[16] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proc. Int. Conf. Machine Learning, 2015, pp. 1889–1897.
[17] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: an evaluation platform for general agents,” in Proc. Int. Joint Conf. Artificial Intelligence, 2015, pp. 253–279.
[18] M. Riedmiller, “Neural fitted q iteration — First experiences with a data efficient neural reinforcement learning method,” in Proc. European Conf. Machine Learning, 2005
[19] S. Lange, M. Riedmiller, and A. Voigtlander, “Autonomous reinforcement learning on raw visual input data in a real world application,” in Proc. Int. Joint Conf. Neural Networks, 2012
[20] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral Brain Sci., pp. 1–101, 2016