An intuitive introduction to Reinforcement Learning

Published in

intelligentmachines

6 min readJun 19, 2020

In recent years, Reinforcement Learning (RL) has been used extensively in many real-life problems. The application field varies from resource management, medical treatment design, tutoring systems robotics to several other fields. Tesla’s self-driving car is making all the headlines in recent days. This is one of the most talked-about examples of Reinforcement Learning. It has also been used to teach computers to control robots in simulation.

Like in the above system called Dactyl, a human-like robot hand has been trained to manipulate physical objects with unprecedented dexterity. Reinforcement Learning has been used here to train this robot hand. Some other applications of Reinforcement Learnings are-

●Defeat the world champion at the chess game

●Manage an investment portfolio

●Control a power station

●Play many different Atari games better than humans

What is Reinforcement Learning?

Formally Reinforcement learning, in the context of artificial intelligence, is a type of dynamic programming that trains algorithms using a system of reward and punishment.

Now let’s explain it in simple terms. I am taking an example from one of my favorite shows. If you are a vivid watcher of The Big Bang Theory as I am, you might recall an episode from season 3 where Sheldon tried to rectify Penny’s behavior by giving her chocolates. Whenever Penny did a good thing in Sheldon’s way, he rewarded her with chocolate. This is termed as a positive reward. Also, when Leonard didn’t agree with him, he punished him by spraying water on him. This is exactly how reinforcement learning works. Obviously, we don’t train a human being here 😜. Rather a machine that is termed as an agent ( for example : a self-driving car, a robot, etc) is trained so that it can learn to achieve the maximum positive reward.

Comparison between Supervised, Unsupervised and Reinforcement Learning

Typically RL falls in the middle of the Supervised and Unsupervised learning. As many of us know that supervised Learning works with the labeled data and here the output data patterns are known to the system. But, the unsupervised learning deals with unlabeled data where the output is based on the collection of perceptions. In RL, not the exact label is given to you and also it is not completely unlabeled. Rather an agent gets a reward for every action that he takes. We can take an intuitive example to explain this. Suppose you want to choose a university for yourself and there are 3 universities ‘A’, ‘B’ and ‘C’ for you. You know for sure that university ‘A’ is the right university for you. You don’t have any right or wrong label for university ‘B’. You feel good about university ‘C’ but you are not sure if it is right or not. So in this example, ‘A’ is supervised, ‘B’ unsupervised and ‘C’ is Reinforcement Learning.

Markov Decision Process (MDP) and some important terms of RL

Reinforcement learning problems are mathematically described using a framework called Markov decision processes (MDPs). The word Markov here refers to Markov property which means that the future state is independent of any previous state's history given the current state and action. This means that the current state encapsulates all that is needed to decide the future state when an input action is received. An MDP problem can be illustrated by the following figure-

**Agent-environment interaction in a typical Markov decision process.**

MDPs are usually described via an agent-environment interaction. It is necessary to make the distinction between the environment and the agent when talking about systems with control. The environment reacts to the agent's actions and gives the agent a reward. In our Big Bang Theory example, Sheldon, Penny and Leonard are the agents, Penny and Leonard’s behaviors are their actions and the chocolate and the water spray are the rewards.

Markov Decision Process (MDP) returns a tuple (S, A, T, R, γ).

S is called the state space. At each discrete time step t, the agent receives some representation of the environment’s state sₜ. So, formally we write, sₜ∈ S, where t = 0, 1, 2, …., N. For finite horizon MDP, N is finite and for infinite horizon MDP, N is infinite.
A is the set of actions the agent can take at each time step to interact with the environment. With aₜ∈ A.
T is known as the transition function. The transition probability that action aₜ in state sₜ at time step t will lead to next state sₜ+ 1 at next time step t + 1 can be expressed by
T(s, a, s’) . And it is defined by asT(s, a, s’) = P r [sₜ+1 = s’|sₜ = s, aₜ= a] .
R is the reward function. It shows how good an agent is at time step t. An agent’s goal is to maximize the long-term expected discounted reward.
γ is the discount factor and t is the discrete-time step. The value of γ lies between 0 to 1.

Now we also need to understand a few other terms which are very important to understand the basics of RL.

π is termed as a policy and it is the agent’s behavior. It is a map from state to action, A policy can be deterministic: a = π(s) or Stochastic : π(a|s) = P[aₜ = a|sₜ= s]
V is the value function. The value function is a prediction of future reward. It is used to evaluate the goodness/badness of states and therefore it is used to select between actions. It can be termed by the following term-

Now that we know the basic terms let us discuss an RL environment. Here we take an environment called Maze Example.

Maze Example

A maze environment can be illustrated by the following figure-

In this environment, white boxes are the states that an agent ( a robot for example). It from the state start. The target of this agent is to reach the state name Goal. Now, in each state, the agent can take any of the four actions- go North, go South, go East, or go West. For each next time-step in the environment, the agent will receive a reward of -1. Now the idea is to find the optimal set of actions ( policy ) for the agent, using which the agent will reach the goal by collecting maximum rewards. First, an agent starts from some random policy. That means at the beginning the agent takes some arbitrary actions. And with each action, the agent collects some reward and then calculates the value function. Based on these value functions, the agent starts to learn which actions are best for him to reach the goal state. And then finally it learns to select the optimal actions using which the maximum discounted reward can be obtained.

**Arrows represent policy π(s) for each state s**

**Numbers represent value of each state s**

And using these values functions the agent selects the optimal policy. Now there might be questions about how to calculate these value functions. That is something I am planning to write in my next article.

Let’s end the article with a demo video of an agent learns to play an Atari game trained by RL.

In this video, we can see that the agent initially performed pretty bad with some arbitrary actions. Gradually as the agent starts to learn more and improves its value function, it becomes better and better and finally, it finds out the optimal policy and learns to play like a pro. 😎

I tried to explain the basics of RL in this article in a simple way. I hope you get the intuition of RL from it.

References

https://www.aitude.com/supervised-vs-unsupervised-vs-reinforcement/

2. http://web.stanford.edu/class/cs234/index.html

3. https://www.davidsilver.uk/teaching/