Reinforcement Learning — Machines learning by interacting with the world

Published in

Analytics Vidhya

7 min readApr 13, 2020

In the past few years, the field of Artificial Intelligence has been booming. It has made impressive advances and has enabled computers to continuously challenge human performance in various domains. Many of us are familiar with AlphaGo which was the first computer program to beat a professional player in the game of Go without any handicaps. Its successor-AlphaZero is currently perceived as the world’s top player in Go as well as possibly in chess.

But Reinforcement Learning (RL) is not good at games only. It has applications in the Financial, Cyber Security and even teach machines to paint. This post is the first one in a series of posts explaining important concepts of Reinforcement Learning.

I am writing this series of articles to cement my understanding of the concepts of RL as I go through the book: Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton. I will write my understanding of various concepts complete with the associated programming tasks too.

What is Reinforcement Learning?

Suppose, we have an agent in an environment the dynamics of which are completely unknown to the agent. The agent can interact with the environment by taking a certain number of actions and the environment, in turn, returns rewards for that action. The agent ought to maximize the total reward cumulated during its episode of interaction with the environment. For example, a bot playing a game, or a robot at a restaurant that is rewarded for cleaning tables after the customer leaves.

An agent taking actions to interact with the environment and is rewarded in turn. Source

The goal is to make sure the agent learns a strategy from the trials and feedback received to maximize the rewards.

Key Concepts

Before proceeding, let us define some key concepts. The agent is acting in an environment. The environment’s reaction to the agent’s interactions is defined by a model of the environment. At any given point of time, the agent is in a state (s∈S) and can take any action from a set of actions (a∈A). Upon taking an action, the agent transitions from the state s to s’. The probability of transitioning from s to s’ is given by the transitional function. The environment rewards the agent from a set of rewards (r∈R). The strategy by which an agent takes an action in a state is called the policy π(s).
While designing an RL agent, the agent might be or might not be familiar with the model of the environment. Hence, there arise two different circumstances:
1. Model-based RL: The agent is familiar with the complete model of the environment or learns about it during its interactions with the environment. Here, if the complete model is known, the optimal solution can be found using Dynamic Programming.
2. Model Free RL: The agent learns a strategy to interact with the environment without any knowledge of the model and does not try to learn about the model’s environment.

The goal of the agent is to take actions so as to maximize the total reward. Each state is associated with a value function V(s) predicting the expected amount of future rewards we are able to receive in this state by acting the corresponding policy. In other words, the value function quantifies how good a state is.

RL Approaches with the name of algorithms for each type of approach. Source

A sequence of interaction between the agent and the environment is known as an episode (also called “trajectory” or “trial”). An episode is composed of states, actions and rewards at any time, t = 1, 2,…, T. At time t, the state, the action taken and the reward observed is denoted by Sₜ, Aₜ and Rₜ, respectively. Thus an episode is composed of the following: Sₜ = S₁, A₁, R₁, S₂, A₂, …, Sₜ.
Some other key terms used commonly:
1. On-policy: Use the deterministic outcomes or samples from the target policy to train the algorithm. A target policy is a policy that is going to be used when the agent is going to be put into action and not being trained.
2. Off-policy: Training on distribution of transitions or episodes produced by a different behaviour policy rather than that produced by the target policy.

What is a model of the environment?

Suppose we are training a robot to walk to a far off point. Formulating it as a very basic task, suppose we want the agent to control the various parts of the robot to facilitate walking in a straight position. The robot should not deviate more than 20° from the vertical axis. The robot is rewarded for every time step it stays within the angular criteria and the closer it gets to the destination, the higher the rewards get. Here, the model of the environment reacts to every action taken by the robot while incorporating the result of gravity, momentum etc. and then returns the next state to the agent. Hence, all the factors that determine the next state that the agent transitions to and the reward it gets are part of the model of the environment.
A model has two major parts, transition function P and reward function R.

Let’s say when we are in state s, we decide to take action a to arrive in the next state s’ and obtain reward r. This is known as one transition step, represented by a tuple (s, a, s’, r). If we are in state s, take action a to arrive in the next state s’ and are rewarded the reward r. This is a single transition: (s, a, s’, r).

The transition function P records the probability of transitioning from state s to s’ after taking action a while obtaining reward r.

From this, we can determine the state transition function:

The reward function R is the expectation of receiving reward r on taking action a in state s.

Policy — The agent’s strategy

The policy (π) is what determines the agent’s behaviour i.e. the action a the agent takes in a state s. The policy can be:
1. Deterministic: For every state, there is a single action defined that the agent will take in that state. π(s) = a
2. Stochastic: The policy returns the probability of taking each action for all the possible actions in state s. (Seems like Neural Networks might be useful here?). π(a|s) = ℙ[A=a|S=s].

Value Function — How good is the state I am in

For every state, there is a value function that determines the total future reward that can be obtained. The future reward, also known as return, is a total sum of discounted rewards going forward. The return is denoted by Gₜ.

The discounting factor γ∈[0,1] penalize the rewards in the future, because:

The future rewards may have higher uncertainty; i.e. stock market.
The future rewards do not provide immediate benefits.
Discounting provides mathematical convenience; i.e., we don’t need to track future steps forever to compute return.

State-value is the expected return that can be obtained when we are in state s at time t.

Similarly, we also have an action-value. It is the expected return by taking action a in state s at time t. It is also called the Q-value.

There is a way by which we can determine V(s) from Q(s, a). What if we take the action values of all the possible actions in a state and multiply them with the probability of taking that action in the state? That is exactly what we do:

Another cool thing is the advantage function. It is the difference between the Q value for an action a in state s and the value of the state s. You can think of it like this: I know that in my current state, I can expect a certain reward. Now, if I take an action, how much better position does it place me in?

Optimal Value and Policy

Since we are talking about learning a policy to maximize our rewards, there must be some “optimal” form. Right? Well there is.

The optimal value function is the value function associated by the policy π that returns the maximum rewards.

And similarly,

And obviously, the optimal policy is what the agent tries to learn. The policy that takes the best possible action in every state so as to really maximize the return.