POLICY GRADIENTS IN DEEP REINFORCEMENT LEARNING

Published in

Analytics Vidhya

9 min readJun 13, 2021

In 2016, a deep learning Reinforcement agent AlphaGobeat Lee Sedol, who is a professional Go player of 9 dan rank (the highest honor in the field of Go). Go was regarded as a game far-fetched from computer algorithms because it has an artistic essence to it. But AlphaGo, developed by Google DeepMind was able to beat Lee Sedol 4–1 in a 5-match series. This event really sent a shrill to human intelligence all over the world, it becomes apparent from the fact that Lee Sedol took a cigarette break during one of the matches.

Reinforcement learning, unlike Supervised learning, works based on a reward function compared to data labels. In Reinforcement Learning, we have state, actions and rewards. The agent has to come up with a strategy to maximize its rewards. An RL agent has two components, i) description of the state based on Value functions, ii) policy distribution. The first one has algorithms like Q-learning, DQN, DDQN etc. The second one has different algorithms which we are going to discuss in this article.

LIMITATIONS OF VALUE-BASED METHODS

The policy is simply mapping a state to different possible actions. So given our state, which action to take so that going forward we can maximize our reward.
In the value-function method, first, we randomly choose a policy…

POLICY GRADIENTS IN DEEP REINFORCEMENT LEARNING

LIMITATIONS OF VALUE-BASED METHODS

Written by Astarag Mohapatra