Reinforcement Learning for beginners

Mehul Gupta
Data Science in your pocket
6 min readJan 8, 2020

--

I have been associated with Data Science for about 2 years now. And one thing that I can guarantee is that it always has surprises for you.

Reinforcement learning is among those surprises you will come to know after you are done with supervised(all those KNNs, SVMs, Forest stuff) & unsupervised learning(K Means, hierarchical clustering & others).

Let's take an example:

Suppose we want a Robo to drive a car. There are a number of steps to follow to drive a car from one point to another.

Open the door, start the engine, change gears ,accelerate at times, put brakes at times ,etc ,etc ,etc.

What we need here is a methodology that helps us learn the most optimal sequential steps that would help us to reach our destination.

Reinforcement Learning helps us learn an optimal sequence of steps to achieve our final aim(reaching our destination) using trial & error.

Moving on with the basic terminologies and drawing references from the above car drive example:

  1. Environment: It is the place where the entire setup is present. For us, it can be the Car & City roads where our Robo is driving.

2. Agent: The agent is the one interacting with the Environment. Robo is our agent in the car drive example.

3. Action: The actions were taken by the agent (Robo) in the environment (car). Applying brakes is an action.

4. State: When an action is taken(applying brakes), changes are observed. Moving car(State_1) changes to the stopped car(State_2).

Starting point is the initial State & final point where the goal is achieved OR you can’t move ahead is terminal state.

5. Reward: Reward is the instant prize given when an action is performed. This reward can both be positive(when you reach the final state, in our case the final destination) or negative(you haven’t reached the destination yet).

The reward system can be modified as well according to the system. If you wish to give small rewards when you are closer to the destination, a bigger reward on reaching the destination and no reward/negative reward when moving away from the destination, it's your wish.

6. Policy: A policy defines the learning agent’s way of behaving at a given time. It is the rule/algorithm we follow to choose the next step when at any state S. E-Greedy can be taken as an example for policy (last point)

7. Episode: The entire cycle where the agent(Robo) starts and somehow, taking many actions and changes states reaches the terminal state is called one episode. To explore an optimal path, many such episodes are performed over & over again. It is similar to epochs in deep learning.

RL can be classified into two broad categories on the basis of whether the task ever ends(like Ludo or driving to destination) i.e. episodic or continuous forever(like swimming to & fro in the river) i.e. has no terminal state

8. Exploit: When we take the best action(highest rewarding action, best known to the model at that point of training) given a state, we call this Exploiting.

Example: Let's suppose we on Road_A(State_1).The robo knows a shortcut which is through a road, ROAD_B on the right.Hence taking a right turn(Action), we will reach the destination faster than usual.

9. Explore: When we randomly choose an action(without considering rewards) given a state, we call it Exploring.

Why Exploring required?

It might be the case that there exists a even shorter route to reach the destination unknown to the agent.If every time we are taking the known shortcut, we might not be able to discover it.Hence a Exploit-Explore trade off exists.

10. Stateless Reinforcement Learning: At times in some situations, the problem statement doesn’t require any states(any sequence of events) to happen to reach the goal. In such cases, the concept of states is omitted. The best example is that of a Multi-Armed Bandit problem.

11. Temporal difference (TD) learning: It is an approach to learning how to predict a quantity that depends on the future values of a given signal.

In RL, Rewards from the future have to be considered when we need to make a robust policy so as to make sure by performing this action at a given state, can we reach the final destination or not. Hence TD learning is an integral part of reinforcement learning.

Example: When robo apply brakes(action), the moving car(state) stops. Hence we can’t reach our destination. We would want our system not to repeat such actions that can’t take us to the end goal. We need to predict this using future rewards. In such cases, TD learning helps us to predict the right action for future states.

12. Markov property: Markov property or memorylessness property states that the future state depends only on the present state & not on the past states of the agent.

Example: On tossing a coin infinitely, even when you had 3 back to back heads, the probability for a tail or head remains 0.5.No matter what was the result in the past.

13. Markov Process: A Markov process is basically a process having a sequence of random states that have Markov property. It must be noted that the Markov process has infinite states.

14. Markov Chain: It is similar to the Markov process but with a finite number of states.

Below is a good example.

It must be noted that it has a finite number of states(7).Also, Future state(Sleep) depends only on Class_2 & Pass and not on other states of the environment(Pub,Class_1,Class_3,Facebook) showing Markov property.

15. Markov Decision Process: A MDP is an environment where all states owe Markov property.

Then How is it different from the Markov chain & Markov process?

This environment has some special properties (S, A, P, R, 𝛾) where

  • S is a set of finite states(as in the above example)
  • A is a set of actions that can be taken
  • P is the Probability of transfer from one state to another
  • R represents Rewards on the transition from one state to other
  • 𝛾 is the discount factor.

16. Discount Factor (𝛾): It is a constant fraction between 0-1 which helps us to set a balance on how much the future rewards should play role in defining the policy at a given state when an action is taken.

We must understand that in Reinforcement, the major goal is to maximize the total accumulated reward from Time T(when action taken) to T+K(when terminal state reached; TD Learning).But rewards pre-assumed from future states shouldn’t be considered as such due to presence of uncertainty & should have a low influence in current decision making. Hence whenever we consider future rewards to update the policy, they are discounted using the discount factor. More can be explore here.

17. epsilon-greedy approach: It is a methodology to decide over the action to be taken for the given state. Here epsilon is a constant fraction between 0-1. This approach helps us to maintain Explore-Exploit Tradeoff. What happens is a random number is generated, if it falls below epsilon, random action is taken else a greedy action for the given state.

18. Value Functions: It is the value/state we keep on updating to know how much should we prefer the present state. This ‘preference’ is generally made by estimating the future reward we might get if present in that state.

--

--