Reinforcement Learning — what rewards you, makes you stronger

Published in

Geek Culture

4 min readNov 30, 2021

Reinforcement Learning refers to an entity learning by trial and error over being explicitly taught in order to maximize the likelihood of best actions. Like any other good definition, let’s break the word down. To ‘reinforce’ means to strengthen/augment something. So if a student puts in hard work to better his/her grades, is that reinforcement? Well, depends. Technically, RL involves an agent (our hero) receiving a quantitative reward as an encoding of the success of their actions and then maximizing the chances of getting a better numerical reward through an optimal policy.

You’ve probably read about RL in a Technology news blog, AI startup cover story, or a sci-fi movie. So let us go over the pros and cons of RL:

Pros: No repeatability in mistakes. Unlike supervised ML models, RL models learn for themselves and are less likely to repeat an error twice. Moreover, they maintain a balance between exploration and exploitation of performance. Unlike other algorithms, RL can strike a perfect policy that discovers new territory and also capitalize on the correct actions of the past.

Cons: RL models may not converge to a policy fast therefore they need an environment that does not change rapidly. This is not true for the real world where they may be deployed. Delayed rewards to our agent can lead to poor convergence in our policy that is not ideal.

When should we consider applying reinforcement learning? and when not?

RL finds great application in situations where you want to simulate a certain process. Like a business trying to figure out consumer reaction to its product’s new UI. Additionally, it is manually difficult to optimize operations of any task because of the large state (status of our agent) space and many options to choose from. RL algorithms can ease labor here.

Now, in some cases, it is very difficult to define a reward function for your agent as a numerical value is needed. e.g. a self-driving car may be rewarded every time it detects an obstacle and stops. But what if somebody gets that obstacle out of the way before the car stops. Should the numeric reward still be the same for this partial success? Even though this problem can be addressed by inverse RL, having hazy reward functions is not the best RL territory. Lastly, if you cannot afford to make mistakes in your simulation then RL models can cost you a lot as they are bound to err during learning.

What’s the difference between supervised learning and reinforcement learning?

Supervised ML uses a set of training data to learn a certain concept given a similar one while RL uses direct interactions with an environment to figure out the best actions to take given base reward criteria. This means RL algorithms don’t have an idea of the environment before they begin committing actions but are more likely to improve as they train. As a matter of fact, neural nets combined with RL have shown great success in getting appropriate Q(reward functions) values.

Offline Reinforcement Learning

This refers to an agent that learns everything through ingestion of data in bulk vs one observation at a time. The data of logged interactions (states, actions, and rewards) must be sent out together instead of continuous transfer. This makes offline RL drastically cheaper and performs better. This approach is often referred to as ‘data driven’.

Pros and Cons: One important ability that offline RL promises over other approaches is to ingest large, varied datasets and produce solutions that generalize broadly to newer situations. For example, policies that are effective at recommending Youtube videos to new users or policies that can execute robotic tasks in strange situations. The ‘generalize’ ability is essential in almost any machine learning system that we might build, but typical RL benchmark tasks do not test this property. This makes offline RL more ‘common sensical’. Due to errors of extrapolation, standard deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to distributions of states, rewards, and actions of current policy. Thus fixing a batch can cause problems with a lack of representability/diversity in data collected.

When to use or not use Offline RL?

Robots and self-driving cars need a large amount of data to acquire their skills so offline RL with pre-trained model and batch data is a great fit. Offline RL can be a good fit in Robotic operators where precision and accuracy in actions cannot be compromised. Online RL uses a partially trained policy or specialized data for tasks that may lead to poor performance of our agent due to sensitivity to online data. In summary, Offline RL models lead to more trustworthy agents.

An interesting example of Offline RL is in news recommendation which is a hot topic today where we can predict the return behavior of certain users with their information on reading, news features like a publisher, length of the article, etc. Context features like timing, relations with other news, and order of news displayed can also be adjusted to recommend articles that boost engagement in a positive manner and reduce misinformation. Remember, information is only as good as it is interpreted. Thus, making offline RL a trustworthy method to deploy in news platforms.

That is all for now. Hope this has excited you enough to learn more about Reinforcement Learning on https://ai.googleblog.com/2021/04/evolving-reinforcement-learning.html

Reinforcement Learning — what rewards you, makes you stronger

Written by Bombay Brown Boy