Introduction to Reinforcement Learning

6 min readDec 14, 2018

Reinforcement Learning(RL) came in to the lime light with the defeat of Lee Sedol by a deep reinforcement learning algorithm developed by Google DeepMind. RL is all about learning from the experience. If we were to formalize how a child learns to walk, the child is an agent who trying to manipulate the environment (which is the surface on which child walks) by taking actions (walking) and child tries to go from one state (each step child takes) to another. The child gets a reward (it gets closer to the toy it’s trying to reach) when child accomplishes the task (taking couple of steps to reach the toy) and will not reach the toy (negative reward) when child is not able to walk.

Reinforcement Learning(RL) is one kind of machine learning where in which agent learns by interacting with environment. RL is different from supervised and unsupervised learning. where the model learns from the data it is provided, without impacting the data itself. RL works by getting the data through a trial and error process, wherein each time the agent gets feedback(through rewards) from the environment and can offset its own observations.

Agent performs action and gets reward and next state from the environment.

Agent :An agent is someone /something who interacts with the environment by executing certain actions, taking observations and receiving rewards.

Environment: It is everything outside of the agent(Environment can be a real or simulated world).Example : Robots learns how to walk in the real world whereas video games carry in a simulated world.

RL problems are solved using a framework called the Markov Decision Process where transitions between states follow the Markov Property.

Markov Property : A process follows the Markov Property if at any state in the process, the future state depends upon the present state alone. This means that the agent decides what action to take based on the current state if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it.

MDP is a tuple of five components (States,Actions, Transition probabilities, Rewards, gamma(Discount Factor))

Let us go through the example to understand these five components in detail.

In the grid ,our agent can start anywhere and get reward(+1 or -1) based on the state agent ends up .Our objective is to maximize the reward

States(S):Here the states are the cells ((0,0),(0,1),(0,2),(0,3),(1,0),(1,2),(1,3),(2,0),(2,1),(2,2),(2,3)). In the states list (0,3) and (1,3) are terminal states .Terminal states are the final

Actions(A):The actions the agent can take in a given state : left, right ,up or down.

State Transition probabilities(P(S|S1)): Probability of moving from one state to another state.Example probability of moving from state (0,0) to (0,1)

Reward(R):It is a scalar value we obtain periodically from the environment, for every action taken by the agent. It can be positive or negative.In our example Grid is our environment and +1 or -1 are the rewards

Gamma(Discount Factor):It is to measure of how far in to the future we look to estimate the future returns. Gamma values ranges from 0 to 1

Policy :Policy is most important and a central concept for MDPs and RL. It is a function that specifies what action should be taken in the state

A deterministic policy always returns the same action with the highest reward Stochastic policy models a distribution over actions, and draws a action according to this distribution.

Value Function:It is the expected return starting from the state depends on the agent’s policy.It determines how good a state is.

let us say we have states and rewards s1,r1,s2,r2,s3,43,s4,r4

From the value function V(S1) can be calculated as follows:

V(S1)=r2+γr3+γ²r4

V(S2)=r3+γr4

V(S1)=r2+γ(r3+γr4)=r2+γ(V(S2))

V(S)=r+γV(S′) . V(S′)-Next State

Bellman’s Equation:

A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. Bellman equations is that they let us express values of states as values of other states. This means that if we know the value of S′ we can easily calculate S

Let us go through an example to calculate the state values using bellman’s equation.

Value of State V(S)=ImmediateReward+Discounted Future Reward

State transition probability from state1 to state 2 P(S1|S2)=0.5

State transition probability from state1 to state 3 P(S1|S3)=0.5

V(S)=r+γV(S′) . V(S′)-Next State

γ -Discount factor [0,1] Let us consider γ =0.9 in our example

The state value of the terminal state should always be zero .

V(S3)=V(S4)=0

V(S2)=1+*γ(V(S4))=1+0.9*0=1

V(S1)=P(S1|S3)*(-0.5+γ*V(S3))+P(S1|S2)(-0.2+γ*V(S2))

=0.5*(-0.5+0.9*0)+0.5(-0.2+0.9*1)=-0.25+0.35=0.10

Let us solve Grid problem by using bellman’s equation.Before solving the problem lets define the policy(Action in a state)

Here we determine the actions that are to be taken in each state.The grid shown here has 3 rows and 4 columns. Imagine this to be a matrix. The top left corner is (0,0) position, when agent in this state, it will take right action.

States and actions:{(0,0)-R,(0,1)-R,(0,2)-R,(1,0)-U,(1,2)-U,(2,0)-U,(2,1)-L,(2,2)-U,(2,3)-L

Agent can start game from any state .Lets see how does agent perform actions when it starts from say state (2,0)

Initially values in all states are zeros. We will calculate each state value using the bellman’s equation.To reach the end state by following the policy from the state(0,2) the agent performs following actions Up,Up,Right,Right,Right.Once agent reaches the end state the game is over.

If agent starts at state(0,0) ,he can perform two actions (Down and Right). As per policy the agent takes the action Right but in real scenarios agent will try all possible actions and takes the action which gives more cumulative reward.Let us calculate values of each state .Agent objective is to reach the terminal state (0,3) .To calculate the value at the state (0,0) ie.

V(0,0) is calculated as immediate reward (V(0,1))plus discounted future reward gamma*(V(0,2)).In our example reward is 0 for all states except for the (0,3) and (1,3) states.

V(0,0)=r+γ*V(0,1)=0+0.9*0.9=0.81

V(0,1)=r+γ*V(0,2)=0+0.9*1=0.9

V(0,2)=1+0.9*0=1

V(1,0)=0+0.9*0.81=0.72

V(2,0)=0+0.9*072=0.63

Similarly we can calculate the values of all states.Once we get values of all the states wherever the agent starts the game always chooses action which gives maximum reward .In this example agent is purely following the policy that means agent taking only one action in each state determined by the policy.If the policy is stochastic the agent can random actions in all states and takes an action which maximizes the reward .

In my next article will discuss about the three methodologies to solve the Markov Decision Process .

1.Dynamic Programming (Policy Iteration ,Value Iteration)

2.Monte Carlo learning(Monte Carlo Prediction and Monte Carlo Control Problems)

3.Temporal Difference(SARSA and Q Learning)

Introduction to Reinforcement Learning

Written by Meetakoti Kirankumar