Basics of Reinforcement Learning | Towards AI

The Fundamentals of Reinforcement Learning

Sam Bell
May 20 · 7 min read

If men were made in the image of God, robots were certainly made in the image of men. Our insights into how we think, how we learn, and even how the networks of neurons in our brains communicate with each other have led to the development of artificial intelligence, machine learning, and deep learning, the cornerstones of data science.

Today, robots can do more than ever before. There is now a suitcase robot that will follow your phone’s geolocation, Travelmate. Moley Robotics has created a robot chef. The Grillbot and the BratWurst Bot were designed for your BBQ parties. Kobi can take care of your yard, and WinBot will wash your windows for you.

All of these robots are programmed to do fairly simple repetitive tasks and that is the extent of their capability. Their ability to adapt and improve their performance in new settings is severely limited. They can’t learn. The problem is that predicting a robot’s every potential need and programming their every possible move is not feasible. It is more efficient to program a robot to learn, but in order to learn, it must be programmed to perceive. Robots are unable to see or hear, and recreating these senses can be difficult. Machines perceive the world as a series of zeroes and ones and if anything new is introduced into their field of vision or too many things are changing at once, the ability to compute these changes is compromised. The amount of time needed to keep up with the input increases exponentially.

The developmental potential for robots is not a hardware problem, it is a software problem. Boston Dynamics has made great strides in giving robots gross motor skills.

Da Vinci Systems have developed robotics for medical procedures. They did surgery on a grape.

Machine learning(ML) is opening the door for robots to become multi-functional and more useful. Rather than creating machines that can perform logic and compute complex calculations, ML focuses on developing a computer’s ability to learn and perform without constant explicit direction. ML allows computers to autonomously learn from data and alter their own algorithms independently.

I studied Psychology during my undergraduate years, and so I must reference BF Skinner when talking about reinforcement learning. BF Skinner defined the idea of operant conditioning in his theory of learning. Basically, if you want to increase the likelihood of specific behavior, there must be a consistent and timely reward system. Positive reinforcement will increase a targeted action and this is how we can shape behavior.

Reinforcement learning has been used to teach computers speech recognition and handwriting recognition. The computer is first fed labeled data that is positive or negative. Is this the letter “A,” or not? A reward and penalty system is implemented, so the program can learn from its mistakes. As soon as the computer is able to differentiate between “A”s and “non-A”s, the training phase is completed and the computer is tested with new data. Through deep learning, reinforcement learning has allowed computers to master Atari games, Dota2, and the Chinese board game Go. You can watch a human repeatedly lose to machine learning in the documentary AlphaGo on Youtube and Netflix. Creating personalized recommendation lists on Netflix or Amazon is possible because of reinforcement learning.

While reinforcement learning has allowed for large developments in artificial intelligence, it is not without its shortcomings.

Sample Efficiency

Montezuma’s Revenge

Sample efficiency describes the large discrepancy between human learning and computer learning. A human could look at a simple video game challenge and figure out what objects you should avoid and what the goal of the game is. A robot takes much longer to catch on. A modern reinforcement learning algorithm needs about 4 million frames in order to solve a level like Montezuma’s Revenge successfully and consistently. The equivalent of 4 million frames in human time is 37 hours of uninterrupted gameplay if the engine is running at 30 frames per second. This is about 2,000x slower than a human being.

Priors Removed
With Priors

Researchers at Berkeley argue that humans have the advantage because of prior object knowledge. We know the shape of ladders, doors, and keys, and we know that we should probably avoid fire and angry skulls because that indicates danger. Researchers at Berkeley created different versions of the same video game. One version with object priors and others with these priors removed individually. In a regularly designed video game with object priors intact, humans were able to complete a level in 1.8 minutes with an average of 3.3 deaths before reaching the goal. The level with all of the object priors removed is actually pretty challenging, but this is how we can imagine playing as the computer. You can try them out yourself here. In fact, when all priors are removed, the learning curve for the level is sharply increased for humans, but computers do not require any additional time to solve the level consistently. By eliminating all of these object priors, human performance in this game increased to 20 minutes and average deaths rose to 40.

Credit Assignment Problem

Machine learning and reinforcement learning are not new ideas. Minsky addressed these topics in his innovative paper from 1961. In this paper, Minsky discusses many concepts that were ahead of his time including the credit assignment problem. Credit assignment involves what kind of actions should result in a reward and which actions should result in a penalty.

Temporal credit assignment is a big problem for reinforcement learning and, if addressed, may reduce the length of time it takes for a computer to “learn” a video game. If a computer is playing Pong pretty well, volleying the ball back and forth for a while. The computer is making smart and quick moves throughout, but in the end, they narrowly miss the ball, which results in a loss. Despite a strong performance in this round, the computer who has been programmed with reinforcement learning will be less likely to use those sequences of moves in the future because it will associate it with a loss. The computer is unable to determine which few moves preceding the loss actually caused the negative result, so its good performance is thrown out. This problem has received the most attention in the reinforcement learning community.

Structural credit assignment involves generalizing which sequence of actions will result in the same outcome. Transfer credit assignment is generalizing which sequence of moves can be applied to different tasks successfully. Big picture pattern recognition would be critical for this skill. Quantifying and addressing these discrepancies in reinforcement learning would greatly reduce the disparity of learning time between computers and humans.

Multi-Armed Bandit

The multi-armed bandit problem is a classic problem in reinforcement learning. There is a fixed limited set of resources and these resources must be allocated between a set of competing choices in order to maximize the reward or maximum gain. The properties of each choice are hidden at the beginning of the trial. Information about each choice is revealed over time or by allocating resources to that choice. This is a dilemma of exploration vs exploitation. The only true way to move through this problem is with trial-and-error exploration.

This problem is often conceptualized as 10 different slot machines. The slot machines either payout or they don’t, but some machines payout more often than others. The goal is to find the machine with the highest win-rate.

One solution for this problem involved allocating 10% of resources toward exploration and the rest of the resources were dedicated to exploiting the machine with the perceived highest payout rate.


There is no shortage of applications for reinforcement learning. It can do well in video games but also be used for energy consumption optimization. There is a website that showcases different algorithms of AI using reinforcement learning. Check it out here! Google’s DeepMind had a computer teach itself how to walk.

And if a robot can teach itself how to flip pancakes, I’m all in.

Machine learning is where data science and robotics meet, and they all have a symbiotic relationship. Teaching computers how to learn as we do is the next big hurdle to jump.

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Sam Bell

Written by

Sam Bell

Student at the Flatiron School for Data Science

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.