A non-technical introduction to Reinforcement Learning (Part 1)

There is no math. I promise. Ok, just realized, there is some.

7 min readOct 1, 2021

Yesterday I binged the latest season of Silicon Valley.
This show has been consistent in taking a comical take on all the recent techno (the AI, crypto kind) stuff. From the “Hot Dog — Not Hot Dog” classifier, crypto trading bots, decentralized internet, chatbots that learned to talk to other bots, super compression algorithms and even incubating a Click farm.
In the recent season, they gave us a taste of Reinforcement learning. Again, in a comical way, it showed how setting up a wrong “Reward Function” led an RL agent to learn that the best and fastest way to remove a bug from the code, is to delete the whole code itself! 😂
This is actually very true. Like in the movies “I, Robot” and “The Terminator”, the machines which were made to serve humanity, learn that the biggest threat to humans is humans themselves!

That is all RL is. Learning ways to destroy humans. No, wait. That’s wrong 😆.

Reinforcement learning is learning the “best possible ways” (or policies) to do a particular task. Here, we need to quantify the “best possible way” which is defined as a “Reward Function”.

Alright, we’ll get to it in a few paragraphs.

1. How do we learn to do “Stuff”? (All roads lead to Mr. Darwin)

We Sense
You are moving across the hall. There is a table. It has legs. You have a pinky toe. They meet.

We Consider
The “touch” (excruciating pain aka the negative reward) runs through our body to our mighty brains.

We Relate and Connect
Actually, our brain relates. It remembers this pain, from our glory days as nappy kids (also from evolution I guess)

We Learn
Our brain uses this reinforcement or punishment to modify the likelihood of repeating this mistake again. Math alert 1!

P(toe_hit|north_remembers 🧠) < 1

And yet, the next day we hit that pinky toe again, but this time the likelihood of this happening in the future diminishes. Math alert 2!

P(toe_hit|north_remembers🧠) <<< 1 [Still not zero you see 😟]

2. Successes of RL in industry

With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player (who has retired by the way), and the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years.
Here is a fun application (Agent Vs Agent) created at OpenAI:

Before we understand how these systems were able to accomplish something like the above, let’s first learn about the building blocks of Reinforcement learning.

Let’s learn to crawl before we run! — pun intended ( ͡° ͜ʖ ͡°)

3. Hello World!

We’ll start by taking an example of the Grid World, the hello world of RL. This is mainly based on Berkeley’s awesome class CS 188 | Introduction to Artificial Intelligence. This will be used to understand concepts in the rest of the article. So let us understand the rules of this deceptively complex game.

The Rules of the Game

A maze-like problem
- The agent lives in a grid.
- Walls block the agent’s path.
Noisy movement: actions do not always go as planned
- 80% of the time, the action North takes the agent North
(if there is no wall there)
- 10% of the time, North takes the agent West; 10% East
- If there is a wall in the direction the agent would have been taken, the agent stays put
The agent receives rewards for each time step
- Small “living” reward each step (can be negative)
- Big rewards come at the end (good or bad)
Goal: maximize the sum of rewards

OK. Great. This game looks like an example of a Stochastic process. Wait, Stochah.. what? 😕

4. Stochastic Vs Deterministic processes

Stating Wikipedia,
A deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state.

So, in a nutshell, the agent takes action -> North. It Goes North 100% of the time. No randomness.

In a stochastic system, there is randomness. There is unpredictability. Let’s take another more intuitive example.

Shortest Path

Deterministic problem

Need to reach from point A to B
Each segment shows time in mins. A to C takes 4 mins. We need to find the shortest path
The shortest path in this problem is ACDEGHB every time we run the simulation.

Stochastic Problem

Let’s say we introduce some traffic with some probabilities in each path
There is a 25% chance it will take 10 mins and a 75% chance it will take 3 mins to reach point C from point A. We have some more probabilities for other segments as well
Now, if we run the simulation multiple times, the shortest time path would be different for each iteration due to randomness in traffic introduced in the system. This is called a Stochastic process
Finding the shortest time route is not straightforward anymore. In the real world, we may not even know these probabilities as well. Our goal is now to find the most probable shortest path. And that is what we do using RL.

5. Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

A real-world example of the above system:

A theme park in France taught crows to pick up trash. How? Well, every time the crows bring in a piece of paper, drop it in the box, they get a piece of food. It becomes a game for them. Nat-geo coverage:

Crows Trained to Pick Up Trash Teach Humans A Lesson

These creatures may be small, but they have an outsized intellect.

video.nationalgeographic.com

Let’s fit this action and reward system into an RL setting.

Imagine an area with some crows, some cigarette buds, and a box with 2 compartments. One with food one with trash. This whole system is our environment
The crow (agent) will first observe the setting of food and the surrounding area state (where the box is, where the trash is etc.)
Then the ingenuous crow will take certain actions like flying directly for food without bringing in the cigarette bud (action) and observe how the box responds (next state)
With no trash being brought in, the door for food does not open leaving the crow frustrated (receiving a negative reward). Still wanting the food, the crow tries different tactics (updating the policy) to get the food.
The crow will repeat the process until it finds a policy (pick bud -> get food) that makes it happy and full (maximizing the total (discounted) rewards).

Mapping the above architecture to our problem statement

In the case of Grid World,

The grid is our environment,
The robot is our agent,
The position of the agent at any particular time is our state
The direction in which we decide to move is our action,
The state of the environment after we take that action is our next state,
The reward is something we get if we successfully exit (+1) or fall in the ditch (-1). Also, for every step, there can be a small living reward (-0.01). This would push the agent to learn the best path in a small number of steps
The policy is the final learned strategy that gives the agent maximum reward

Cool beans

In the next article, we’ll use the above maze system to get a deeper understanding of how Rewards and Policy are learned. We’ll also go into Markov Decision Processes and Q-learning methods to learn an optimal policy.

Homework

Till then, have a look at this fun experiment: Marshmallow Experiment and try and relate it to this image in the comments:

Liked it? Give a 👏 😄 and follow for more!

More by the author

Custom dataset in Pytorch — Part 1. Images

Pytorch has a great ecosystem to load custom datasets for training machine learning models. This is the first part of…

towardsdatascience.com

Visualizing context with Google’s Universal Sentence Encoder and GraphDB

Sentence embeddings and graph connections in Neo4j