Basics — Reinforcement Learning

Published in

Analytics Vidhya

8 min readApr 24, 2020

As you know, machine learning is a sub-category of AI. Similarly, various learning algorithms fall under Machine Learning. Here in this series, I am going to cover the basics of Reinforcement Learning (RL) at an intermediate level. When we talk about Reinforcement Learning many people will have the notion of what the features will be and how are we gonna predict our target. To give a brief outline, RL is a technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. Though both supervised and reinforcement learning use mapping between input and output, unlike supervised learning, where feedback provided to the agent is the correct set of actions for performing a task, reinforcement learning uses rewards and punishments as a signal for positive and negative behavior.

Application of Reinforcement Learning

To make it more clear, here are some practical applications of Reinforcement Learning

Robotics for industrial automation.
Business strategy planning
Machine learning and data processing
Self-driving cars
Aircraft control and robot motion control
Stock Market Trading

The architecture of Reinforcement Learning

From the image above we can understand how our basic architecture of Reinforcement Learning looks. Our RL model consists of an agent, which we need to train to interact with the environment by making a sequence of decisions (actions). The environment can be anything, it is not confined to a gaming platform. Even RL is used in stock market trading. In that case, the stock market is our environment where the agent is going to interact. The environment will tell our agent’s state and also the agent gets feedback from the environment, which is the reward.

Goal of Reinforcement learning: To maximize our total future rewards.

Reward hypothesis: Our agent understands more about its interaction with the environment through the reward (scalar feedback) which it gets from the environment. So, rewards are our central idea to make our agents better in the environment.

In real-world scenarios, we should be defining our own rewards to the environment.

I hope you got the true essence of what Reinforcement Learning is and the main goal of it. Let’s dive a little deeper into the terminologies which are used in the field of Reinforcement Learning

Components of Reinforcement Learning

These are the most common terminologies that are used when we start talking about RL

State: The current situation of the agent.

Action: The decision or the move made by the agent to change its current state to future state

Environment: The place where the agent interacts and performs the action, the environment can be physical or virtual

Agent: The one that we train in our Reinforcement Learning problem to tackle the environment.

Reward: The feedback given to the agent after the move from the current state to the future state. The feedback signal is scalar.

Policy: Optimal policy which is formulated as a strategy, that an agent adopts from moving from one state to another.

Value function: The reward that an agent would get by taking an action in the current state to move to the next future state.

Markov Decision Process

All our problems in Reinforcement learning can be formulated as a Markov decision process. An MDP consists of a set of finite environment states S, a set of possible actions A(s) in each state, a real-valued reward function R(s), and a transition model P(s’, s | a). However, real-world environments are more likely to lack any prior knowledge of environment dynamics.

Therefore MDP and agent together give rise to a sequence or trajectory that begins like this:

S0, A0, R1, S1, A1, R2, S2, A2, R3,…..

Conditional Probability to determine future rewards

The above function gives us the probability of the future reward with the condition that we take the specific action ‘a’ in the state ‘s’ to move to the state s’ by getting the reward

Illustration to understand more about how the Reinforcement learning algorithm works.

In our image above, the stick figure is our agent. We need to train the agent to get the frisbee, which is pink in color, but it should be careful not to fall inside the hole. Our environment is the 4 x 4 grid where the agent interacts.

Actions that an agent can take are: Move left. Move right, Move Forward, and Backward. States are the observation of the environment in this example, the position of the agent is the state. So, now we have 16 states in our environment. We can formulate a reward. if the agent falls into the hole the game over and the reward is -1 and reaching the end of the goal is +1.

Our Agent receives state S0 from the Environment (In our case we receive the first position of our agent(state) from Frozen Lake (environment))
Based on that state S0, the agent takes an action A0 (our agent will move left)
Environment transitions to a new state S1 (new frame)
The environment gives some reward R1 to the agent (not stepping the ice: +1)

The RL loops output → (States, Actions, Rewards)

The main goal is to maximize our expected cumulative reward.

Sum of future rewards at each and every time step

For each time step, we are summing up our rewards. In the real world, we can’t add up all the rewards as some of the rewards do not come immediately. To care more about the future reward, we introduce a term called the “gamma” discount factor which will be used to discount our future rewards. The value of gamma should be in the range of 0 to 1. This is considered as one of the important hyperparameters.

The larger the gamma value, the agent cares more about the long term reward.
On the other end, the smaller the value, the agent cares more about the nearest reward.

So it is important to tune our gamma value to balance so that our goal of maximizing the future discounted cumulative reward

Summation of all the reward with the discount factor

Approaches to reinforcement learning

The following are the approaches that are widely used in Reinforcement Learning

Value-based(state-action pairs): In a value-based approach, our aim is to maximize our value function V(s). Analyze the good value needed to reach a certain state or take a specific action (i.e. Value-learning).

From all the value functions, we will take the maximum value of the state-action pairs

Policy-based: In policy-based RL, we have to directly optimize the policy function π(s), without using a value function. Derive an optimal policy directly to maximize rewards. Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy π at time t, then π(a|s) is the probability that A = a if S = s.

We have two types of Policy:

Deterministic: A policy at a given state will always return the same action. The same action in the same state leads to the same next state for all the time steps.
Stochastic: output a distribution probability over actions. The agent takes an action in a certain state, the resulting next state of the environment is necessarily the same all the time, these uncertainties lead to harder to find the optimal policy

Monte-Carlo vs TD learning:

Monte-Carlo method in Reinforcement Learning in which the agent interacts with the environment but get the feedback from the environment at the end of the episode. Unlike, value function in which it gets feedback at each time step to understanding how good the current state is.

In this Monte-Carlo method, the agent generates the experienced samples and in the end, the average return is calculated and we use that to understand state or state-action values. This can be only used in the episodic problem.

TD (Temporal Difference) Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.

This method is called TD(0) or one step TD (update the value function after any individual step).

Model-based vs Model-free Reinforcement learning

In model-based learning, the agent exploits a previously learned model to accomplish the task at hand whereas, in model-free learning, the agent simply relies on some trial-and-error experience for action selection. Q-Learning is considered as model-free learning as it learns from each experience.

Taxonomy of the Reinforcement Learning

From looking at the above images, I would infer that there are various areas to explore and research under the Reinforcement Learning algorithms. Here in the series of tutorials, as a beginner, I will be playing with the concepts under model-free algorithms.

In the model-free algorithm, there are two division which are

Policy Optimization
Q learning

I would like to initially focus on the Q learning and then dive deeper into Deep Q Learning.