Reinforcement Learning: How Tech Teaches Itself

Published in

The Startup

9 min readNov 3, 2019

**Image 1-** Robot Playing Piano, taken by Franck V. on Unsplash

We, as humans, have been fascinated by the concept of artificial intelligence since the 1950s. Sure, we can program computers and our devices to do things for us, but computers don’t have much of a purpose without instructions. With Reinforcement Learning, we are getting ever closer to something that can mimic the human mind.

What Reinforcement Learning Is + Other Important Terms

Artificial Intelligence is the idea of being able to recreate human intelligence, and machine learning (ML) is a field of computer science that uses statistics and algorithms to teach computers how to do things without being told exactly what to do.

One area of ML that shows significant promise is Reinforcement Learning (RL). This system works by programming algorithms to aim for the highest possible reward over a series of steps; with the program learning to complete “good” actions, and to avoid “bad” actions. This system is called reinforcement, which is where the name Reinforcement Learning comes from.

The Baby Analogy

By nature, RL learns through a kind of trial-and-error method, so to better explain it, let’s use the analogy of a baby learning new things on its own. He or she will constantly be trying new things, by touching them, smelling them, and even by tasting them. While the baby is trying all these things, it’ll learn that certain actions will have positive effects, and others will have negative ones.

**Image 2-** Photo by Priscilla Du Preez on Unsplash

For example, a baby might learn that holding his/her parents’ hands makes them happy, but pulling their hair has the opposite effect. Over time, the baby should be able to distinguish between these positive and negative actions, and adjust them to complete tasks.

This, in a nutshell, is how reinforcement learning works. Like babies, RL algorithms don’t have predefined notions of how the world around it works, and has to use its own means to learn about its surroundings (excluding parents, of course).

Delayed Return

One thing that makes RL interesting is that it uses something known as a delayed return environment, which means that its trying to focus on obtaining success in the long-term, and not just short-term good decisions. The system helps to reduce the amount of times an RL algorithm goes down the wrong path, preventing it from making too many poor decisions.

**Image 3-** The Marshmallow Experiment, testing delayed gratification

This works similarly to the marshmallow test, in which children were shown a marshmallow (the instant reward), and were given the option to wait 15 minutes for a second marshmallow as well (the delayed reward). The study went on to see how well the participants did in life, and those who favored the delayed reward seemingly accomplished more. While not a perfect parallel, RL works to maximize the delayed value, rather than the short-term reward.

How are we using RL today?

So, before we delve too much deeper into RL, let’s look at one of the more prominent ways its being tested today: Game Playing. From Pong to Dota 2, reinforcement learning is being used to learn how to play loads of different games, and they’ve actually gotten quite good at them; sometimes even outranking their human counterparts. The following demo was created by OpenAI which uses reinforcement learning to encourage the agents to play hide and seek effectively.

Video 1- Multi-Agent Hide and Seek, by OpenAI

Key Terms:

There are a few important terms that are used in reinforcement learning which serve to describe the different components in an RL system. While I won’t be covering all of them, here are the some of the main ones:

Agent: This is the actual algorithm in RL. To go back to the baby analogy, the baby making decisions would be the agent in that scenario.
Action: These are the decisions the algorithm can make. Since reinforcement learning typically works by choosing the best from a set of actions, there are many actions the agent has to choose from at any given moment.
State: A situation in which the RL algorithm has to take an action, like a baby having to choose between two toys.
Environment: If the agent is a character in a video game, the environment would be the game’s world.
Reward: What an RL algorithm is constantly trying to maximize. The more of the reward it obtains, the better it’s doing.

Related image — **Image 4-** Agent and Environment relationship through states, actions, and rewards

Policy: This is the middle step between the input state and an output action.
Value: Essentially the long term reward behind what the algorithm is doing, taking in both the current state and the policy. As was stated before, RL works in a delayed return environment, meaning it cares more about the effects in the future than in the present.
Q-Value: This takes in the state, action, and policy to estimate the long term reward using both the state and the action.

Types of RL

There are many different ways to approach reinforcement learning, but there were three types which stood out to me as the most interesting.

Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) uses deep neural networks in conjunction with RL, which in general makes mapping a state to an action more efficient, as potential values and Q-values can be obtained more quickly.

A traditional reinforcement learning system uses a bunch of processing power to convert the policy, action, and state into a Q-value, and it returns large lists of numbers. This is where a neural network can come into play, which is a set of functions that take a series of inputs, and approximate an output. Rather than defining the Q-value as a specific function, we can use these neural networks to approximate the response with a high accuracy, reducing the total number of processes the RL algorithm needs to complete.

For an example of what can be done with DRL, Google’s Deep Mind can play Atari games while being trained through this process.

Video 2- Google’s Deep Mind playing Atari Breakout using DRL

Inverse Reinforcement Learning

Normally, reinforcement learning works to determine the right policy function to learn something using a reward function. Inverse Reinforcement Learning (IRL) uses a set of training data to determine the reward function instead.

**Image 5-** Photo by Pablo Franco on Unsplash

Let’s assume we wanted to learn what makes a good Formula 1 driver for making an autonomous vehicle algorithm. It’s very hard to define what makes a good or bad driver manually, but here’s where IRL shines. By analyzing footage and other data of expert human drivers, IRL can determine what defines a good action and a bad action, while approximating what the reward function will be.

Apprenticeship Learning

Apprenticeship Learning, or AL (yes I know, there are a lot of acronyms) ties well with inverse reinforcement learning, as it uses the product of IRL to determine the best actions to achieve the highest rewards. It’s essentially a traditional RL system being applied to an IRL model. One key part of AL is that it doesn’t perfectly imitate the expert that it’s trying to replicate, and that it can learn which actions are not influential on the reward, which can reduce required computing power significantly.

It might be hard to visualize the capabilities of IRL and AL together, but in this video, these methods of reinforcement learning were used to teach a car to drive while staying in its own lane!

Video 3- Wayve AI teaching a car basic driving just a day using AL

Problems with Reinforcement Learning

As with everything, there are limitations and problems with current methods of reinforcement learning.

Credit Assignment Problem

As was stated before, reinforcement learning works in a delayed return environment, meaning that it values long-term reward more than any short-term benefits. One issue with this is that the model may have a hard time determining which action(s) in a series of action that led to the highest reward. This results in the algorithm not knowing which actions to repeat.

**Image 6-** Photo by Green Chameleon on Unsplash

To use an analogy, it can be compared to you completing a multiple choice quiz on a subject, getting your grade, but not knowing which questions you got right. Just as it may seem, this makes it much harder for you to improve for any future quizzes on the topic.

Exploration and Exploitation

Another hurdle with RL is finding the best trade-off between exploration and exploitation, which define how often the model tries new things in hopes of finding new rewards (exploration) vs. how often it goes for the tried and tested (exploitation). If the system exploits too much, it could be missing out on the new reward paths, but if it explores too much, it may end up foregoing the obvious/guaranteed rewards.

**Image 7-** Exploitation or Exploration

Sample Data Inefficiency

Lastly, there’s the issue of sample inefficiency, as reinforcement learning needs a large sample size to effectively complete tasks. This ties in with exploration and exploitation, as while trying to find the best policy, RL systems can end up going through large chunks of data. For reference, if reinforcement learning was to be used for learning how to play a video game, it could take hundreds of thousands of times more data than a human would to reach comparable levels of mastery.

The Future of Reinforcement Learning

Although RL has some problems that still need to be worked out, it has extreme potential for the future, and could provide extreme benefits. The first step comes with solving the aforementioned problems, and luckily, there’s been some pretty solid progress on that front.

Solving the Problems

When it comes to the credit assignment problem, there has been some progress using something similar to backpropagation, going backwards through a policy function and return values of how influential each action was toward a reward or loss.
To solve for the exploration vs. exploitation balance, we can implement a system which analyzes an agent’s degrees of freedom — meaning the various routes of actions it can take — and increase or decrease the exploration/exploitation levels. The more degrees of freedom (DoF) there are, the more the agent will explore, and when the DoFs are reduced, the more the agent will exploit.

**Image 8-** Reinforcement Learning must be able to find a balance between both exploration and exploitation to optimize the system

Sample data inefficiency is a complex problem, as with the current effectiveness of RL systems, increasing sample efficiency seems like a far-fetched task. However, by using something called an auxiliary reward function, we can simplify the parameters required to obtain a reward, and make it easier for the model to improve. The auxiliary reward function would have features similar to the true reward function, and may significantly increase speeds.

What might we see from RL in the coming years?

Reinforcement Learning is growing quickly as a field of machine learning, and in the coming years, we may start to see it being used in everyday items and products. As of now, we don’t have to worry too much about AI taking over, but here’s what we could have in the future:

We could see more advanced robots that were trained using RL
There may be large repositories of digital knowledge that could be easily transferred in between RL systems
Self-Driving Cars trained using RL
Intelligent Traffic Lights (for when we still have to drive ourselves)

**Image 9-** Robot being trained using RL

Summary/Key Takeaways

Reinforcement Learning uses the concept of reinforcement, being given rewards for positive actions, and being punished for negative actions to learn how to accomplish a specific task
The task is defined using a reward function, and the RL algorithm uses a policy function to determine the best way to maximize reward
Reinforcement learning uses a delayed return environment, meaning that it gives preference to a long-term value, rather than short-term reward
RL can incorporate neural networks, work backwards to find the reward function instead, and even learn directly from human expert data.
Current problems with RL include balancing risk-taking vs. certain rewards, using sample data inefficiently, and not knowing which actions were responsible for each reward.

Thank you for reading! For any questions, comments, or corrections, feel free to contact/connect with me from any of the sources below. Feedback is much appreciated as well.

LinkedIn: Vedaant Varshney

e-mail: vedaant.varshney@gmail.com