Deep Reinforcement Learning — Driving AlphaGo

4 min readMar 29, 2016

Deep Reinforcement Learning has been one of the coolest concepts that have come about in the recent past.

It was this paper by DeepMind that introduced this concept and made the computer learn playing Atari games just by starting off afresh and learning by making errors and slowly getting better after seeing the rewards.

REINFORCEMENT LEARNING (Q-Learning)

The major components of reinforcement learning are :

Set of Environment States : Such as the different states in the game at a point in time
Set of Actions :Such as Up , Down ,Left ,Right and a Fire button.
Rules of transitioning between states : We need to keep track of the best next state we can go to.
Rules that determine the scalar immediate reward of a transition: For every transition that the algorithm decides to take there is an associated reward associated with that step.(For example when you kill an opponent you get a positive reward and when you get hurt you get a negative reward)

Q-Learning

Q-Learning works by learning a state-action table Q(S,A) . For every State S, and action A it tries to memorize the maximum reward at the current state and taking the action A from this state S.

This table Q(S,A) is learnt for several steps at the beginning to get an idea about the terrain and the different positions at the table.The algorithm starts as a toddler and after a few games, based on the rewards it receives, the algorithm learns favorable actions at the various states, thus getting slowly better.

Deep Q-Learning

In the Deep Q-Learning framework , instead of the table Q(S,A) they have a neural network , possibly multilayered parameterized by theta.

The crux of Deep Q Learning is that the state, action pair is encoded into a vector and is passed through the Multilayered network and the output of the network is the estimated Q Value. In this manner , at every state all possible actions are considered and the best action is selected based on the maximum Q value from the Network.

Training

The technique known as experience replay is employed in which each experience at each time step is pooled over many episodes called a replay memory. During the training, the replay memory is sampled and is used for training the Multilayered Neural Network using Gradient Descent.

Benefits From Deep Q-Learning

The traditional Q table goes out of proportion for medium sized games since the number of possible states in several games such as Chess and Go and even for video games such as Atari.
The traditional Q table requires several passes through all the states and several initial dummy games for it to learn the entire Q table of favorable states and actions.
The Deep Q-Learning offers a much compressed representation in terms of a single neural network and hence speeds up the learning and also solves the problem of storing the huge Q Table in memory.

Tying Up Loose Pieces and Ends

The system uses a CNN to get the features of the game state that the software sees on the screen of the Atari arcade game.
They were able to show performances not as competitive as humans but orders of magnitude better than that of traditional Q-Learning methods such as SARSA.

How Alphago Uses DEEP Q-Learning

It uses 4 networks :

Fast Rollout Policy Network (P-Network) :It rolls out a quick plan for the game.
Supervised Learning Policy Network (SL-Network) :The P-Network and the SL-Network are trained to predict human expert moves in a data set of positions.
Reinforcement Learning Policy Network (RL-Network) :The RL Network is initialized as the SL policy network, and is then improved by policy gradient learning to maximize the outcome (that is, winning more games) against previous versions of the policy network. A new data set is generated by playing games of self-play with the RL policy network.
Value network : a Value Network V is trained by regression to predict the expected outcome (that is, whether the current player wins) in positions from the self-play data set.

Final Words

The Reinforcement Learning framework is the gold standard that we are trying to achieve in terms of Artificial Intelligence, an artificially intelligent agent that learns on its own on a new environment based on the feedback that it gets from the environment. This is exactly how humans and other animals learn from our mistakes and failures.

I believe Deep-Q-Learning is another step forward in this direction using a neural network to predict the estimated Q-Value of any State-Action pair , just analogous to how us humans evaluate all the possible steps at a particular stage.If this can be generalized to more generalized tasks then systems such as Jarvis won’t be far into the future.

Deep Reinforcement Learning — Driving AlphaGo

REINFORCEMENT LEARNING (Q-Learning)

Final Words

Written by Arnab Ghosh