Difference b/w Supervised Learning and Reinforcement Learning

Srikanth Kb
Analytics Vidhya
Published in
5 min readMay 1, 2020

To explain the subtle difference between the two disciplines with mathematics

Given the abundance of resources, content and examples of machine learning, deep learning etc there is always a question in this discipline that can frustrate our plans of mastering these skills. One such question is the difference between Supervised learning and Reinforcement learning. The answer to this could add clarity, intuition and deepen the level of understanding in these disciplines. To explain this difference at level of mathematical equations, is the purpose of this post.

In the pursuit of this question, you’d have stumbled upon basics of Machine Learning, Deep Learning, Supervised & Unsupervised learning and reinforcement learning. Assuming the basic knowledge of these, we will directly tackle the question in mind.

To begin with, there are a lot of cosmetic differences between Supervised Learning (SL) and Reinforcement Learning (RL). However, these differences operate at the higher level, without providing any intuition about the underlying subtle differences. ( Care to revise? … Click here )

Supervised Learning

Now, consider supervised learning,

  1. We have a dataset, with labels annotation to each member of the dataset.
  2. We use the dataset to train a neural network (NN) , so that it learns a way to map the respective data to its label.
  3. We use the trained model to check its performance on test (or real-life) data and evaluate its performance.

In the above explanation, the most important step (second one) decides the efficiency of the model. In this training step, we use the Gradient Descent Algorithm. (Why is this used?… Check here for details )

The algorithm mathematically represented looks like this:

Gradient Descent Representation equation [1]

Where wᵢ represents weights for the network
and α represents the learning rate.

Focusing on the Error function, it describes the difference between two quantities:

  1. Value ŷ : Output value of the neural network.
  2. Value y : The ground true label value (obtained from the labelled dataset)

In the simplest case of supervised learning, if the error function used is “Mean absolute Error” (Check other types here), then the error would look like:

Error Equation for Supervised Learning [2]

Where b is the bias parameter for the network.
With the understanding of equation represented by [2] , we are half way through to the answer.

Reinforcement Learning

Now consider the case of Reinforcement Learning below:

The RL Framework (Source) [3]
  1. We have the agent, neural network (or model) and environment.
  2. We use the state, action, reward, next_state policies to train the model represented by a neural network.
  3. The agent is trained and performs actions based on an (nearly) optimal policy that maximizes the cumulative reward from the environment.

In the crude explanation of Reinforcement Learning above, the most important step (obviously) is to train the network. Well, the irony is that we use Gradient Descent Algorithm here too. But How?

The mathematical expression for gradient descent algorithm remain the same (derived from 1). But, the catch is the error function. In Reinforcement Learning, we can compute error based on Action-value function and State-value functions. In both of the cases the difference should look like this:

Error Equations based on State-value, Action-value functions [4]

The Difference

Surprisingly from the equations ([2] & [4]), these two types of machine learning are more than similar ( Aren’t they? )
Actually No. Even though mathematically described the same, the two equations differ at one particular performance point. The subtlest difference between SL and RL based on the equations above is described below…

In supervised learning, we have a ground true label value (yᵢ) to compare our network’s result (ŷ) with, but in case of reinforcement learning, we actually do not have the correct state-value(v(π)) and action-value(q(π)) functions. There is NO entity, that would inform us about the optimal state-value(v(π)) and action-value(q(π)) functions. It is the objective of the agent to interact with environment and calculate the correct value functions for the equations.

Then, are we using the value functions to calculate value functions from the network ? ( Is this an equation-paradox?! )
This is where the Epsilon-greedy policy (What’s this ?) comes in to help us achieve optimal values. Under most of the popular learning algorithms (SARSA, Q-Learning) in reinforcement learning, the network is initialized with random values w for and the policy(π) used here is ε-greedy policy based on the action-values q(s,a,w) from the neural network.
So, we use the ε-greedy policy to take actions and update the values based on the rewards (R) and the action-value of the next state q(s`,a,w).

Gradient descent for Deep learning [5]

Where γ is the discount rate, which decides how important are the future rewards for the agent. ()

In summary, there is a lot of difference in the requirements, operating environments, variables needed to run supervised learning and reinforcement learning. But, mathematical equations describe the most subtle and important difference between these two types of machine learning. Is it us? Or Is it Math that deserves credit for such an explanation ?!

Well, with mathematics, the intuition to design any system always gets better and interesting than ever!

--

--

Srikanth Kb
Analytics Vidhya

ML @Tesla. All about math, machine learning, music and software!