Reinforcement learning with A3C

Nihal Das

Follow

Published in

Analytics Vidhya

4 min readMay 22, 2020

--

This article’s motivation comes from a recent competition I was part of, where we had to train a model on Atari SpaceInvaders game and maximize the score the agent achieves across 100 runs.

Given this was my first experience with Reinforcement learning, I started out with Deep Q Networks and the variations around it. Although the test score was very pleasing for a beginner, it was quite unstable and would take considerable amount of training to reach a good score.

Next in line was A3C - which is a reinforcement learning algorithm developed by Google Deep Mind that completely blows most algorithms like Deep Q Networks (DQN) with scores it can achieve in short period of time.

A3C stands for Asynchronous Advantage Actor-Critic where

Asynchronous

means multiprocessing. Here multiple agents work together on the same problem and share information with each other about what they have learnt. With many heads trying to solve the problem, the solution is reached in faster way.

Each of these agents interacts with it’s own copy of the environment at the same time. This really works better than having a single agent because each agent’s experience is independent and unique to the other agents. This way we have a diverse set of experience.

Actor-Critic

The actor-critic model is basically Deep Convolution Q-Learning model where agent implements q-learning. Here the input is images(a snap of the current state) and they are feed into a deep convolutional neural network.

In a basic Deep Convolution Q-Learning model, the output would be q-value for the possible actions the agent could take for a given state. However in A3C, there are two outputs, one being the q-values for the different actions and the other to calculate the value of being in the state the agent is actually in.

Advantage

The advantage is the value that tells us if there is an improvement in a certain action compared to the expected average value of that state based on.

The Q(s, a) refers to the Q value or the expected future reward of taking an action at a certain state. The V(s) refers to the value of being in a certain state. The objective of the model is to maximize the advantage value

Now that we have established the basic understanding, let’s join them together to understand the complete working of this model. One of the major component that brings all of them together is the shared memory.

Memory

We make use of Long Short Term Memory(LSTM) cell to achieve this. The output that is derived from the Deep Convolution Q-Network is now passed to LSTM layer and that will pass on the values to a full-connected layer. The LSTM layer provides the model a memory to remember the past experience and make decision based off it.

The final output from the fully connected layer from where an action is selected for the actor neural network. The value is also passed to the critic neural network where the value is updated. The neural network’s weights are updated by calculating the value loss for the critic and the policy loss for the actor and then back-propagate the errors through the network.

This algorithm is currently the state-of-the-art in the field of Reinforcement learning. It has proven to be successful on variety of game environments with scores, which are very difficult to achieve by human player alone in a short period of time.

One of the major advancement being AlphaGo which was an AI beating the world’s best player at the ancient board game of Go.

If you are fascinated by games and would love to see an AI beat the game, definitely check out more on reinforcement learning. It is a really interesting field that is growing with people coming out with different strategies and idea to tackle problems. Soon Reinforcement learning will tackle real world scenarios, until then keep learning and keep exploring!

References

Connect with me

*** Thank you all for reading this article. Your suggestions are very much appreciated! ***