Reinforcement Learning

Harshit Awasthi
4 min readFeb 28, 2018

--

Machine Learning is generally of three types — Supervised Learning, Unsupervised Learning and lastly Reinforcement Learning. I’ll discuss what is reinforcement learning, why is it needed and when to apply it.

Reinforcement learning is class of algorithms based on learning through trial and error. A typical reinforcement learning setting has an environment state(s), action(a), reward(r),probability(T), state value function(v), action value function(q). An episode is the defined as a complete sequence from start to end of a task.

An agent(which is supposed to learn the required behaviour) is presented a certain state by the environment and it chooses an action randomly based on certain policy. It then gets a reward based on it’s action and the next state of the environment. The goal of the agent is to maximize the total expected reward that it get’s at the end of an episode. That is it seeks to choose an action which is likely to maximize the expected cummulative reward.

For instance, consider a cleaning robot that is supposed to pick up few objects and place it in the correct box out of the two boxes present. So initially the robot has no idea of the correct box, it just knows the current state of the surrounding environment. Now say based on a certain policy it decides to choose a certain action, like put the object in box1. Now if that is the correct box it recieves a certain positive reward while if it is wrong one then agent will recieve negative reward score.The goal of the agent is to maximize the total return it gets at the end. At the end of each episode it updates it’s policy of choosing a particular action based on a particular state. We do this until we converge to an optimal policy v and q each.

Generally there are two types of reinforcement learning tasks:

  1. Episodic Tasks
  2. Continuing Tasks

Episodic tasks are those for which we can define a certain ending point. For instance creating an agent to drive a self driving car. The episode ends when the car crashes. When the episode ends, agent starts from scratch in the same environment but it now has an added knowledge of what has happened in it’s past episode. In this way it makes better decisions in future time. So now it can pick a strategy when any new state is presented such that it maximize the cummulative reward it recieves.

Not all tasks are episodic tasks. For instance in share markets we have a continous change in prices of stocks that are avaliable. So if you want to create an AI that sells and purchases stocks from the market will be labelled as a continuing task. Or we can also say that there is only one episode which never ends.

When to use reinforcement Learning

In practice, we mostly don’t have labels avaliable to train our machine learning models. So reinforcement learning comes to the rescue. In reinforcement learning labels are time delayed and we instead call them as rewards. So we use reinforcement learning in cases we don’t have labels or it it more better to use a trial and error approach rather than generating unsupervised labels.

Reinforcement Learning can be classified into two models:

  1. Model free method
  2. Model based method

Model based learning is when the agent has knolwedge of one step dynamics of the environment. In other words when an agent knows what state and reward it will get before taking an action it is said to be a model based learning approach. On the contrast, when the agent does not know one step dynamics of the environment beforehand it is said to be a model free reinforcement learning.

Examples of model based methods:

Dynamic programming to iteratively compute various parameters like (T and R) and based on that it can perform policy evaluation and policy iteration.

Examples of model free methods:

Monte Carlo prediction, TD learning, Q learning, sarsa etc.

In reinforcement learning we distinguish between On-policy methods which attempt to evaluate or improve the policy that is used to make decisions. Off-policy methods which evaluate or improve a policy different from that used to generate the data.

Here is one of my implementation of a backjack environment using monte carlo technique to estimate optimal state value function and action value function.

This was a brief overview of reinforcement learning basics. Obviously there is a lot to cover. I’ll try to cover the maths and algorithms that are involved in various models in next post. Till then stay awesome and keep learning!

--

--