Reinforcement Learning: How Intelligent Bots are Built

Ambuj Agrawal
DataSeries
Published in
4 min readAug 18, 2020

You may have heard of AlphaGo, the computer program that beat the World Champion of Go at the game. You may also have come across AlphaZero, the AI which taught itself to play chess in 4 hours and came up with some truly remarkable gameplay. Well, guess what, we’re going to explore the very tech that goes behind these algorithms in this blog post.

Reinforcement learning (RL) is an area of machine learning concerned with how software/robotic agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning domains, alongside supervised learning and unsupervised learning.

It is important to distinguish between problems and their solutions, or in other words, between the tasks we wish to solve and the algorithms we design to solve the tasks. Deep learning algorithms can be applied to various problem types. Image classification and prediction tasks are common applications of deep learning because automated image processing before deep learning was very limited, given the complexity of images. But there are many other kinds of tasks we might wish to automate, such as driving a car or balancing a portfolio of stocks and other assets. Driving a car includes some amount of image processing, but more importantly the algorithm needs to learn how to act appropriately, not merely to classify or predict. These kinds of problems, where decisions must be made or some behaviour must be enacted, are collectively called control tasks.

The standard framework for RL algorithms is visualized here

The agent takes an action in the environment, such as taking a turn while driving a car, which then updates the state of the environment. The environment state consists of a set of variables with values that define the existing conditions in that particular state. For every action that the agent takes, it receives a reward (for e.g., +1 for winning the game, –1 for losing the game, and so on). The RL algorithm repeats this process with the objective of maximizing rewards in the long term, and it eventually learns how the environment works.

In the case of AlphaGo, it was a computer program that combined advanced search tree with deep neural networks. The neural networks took a description of the Go board as an input and processed it through a number of different network layers containing millions of neuron-like connections.

One neural network, the “policy network”, selects the next move to play. Policy networks are common for other reinforcement learning systems as well, and are responsible for predicting the best course of action to take, given the current state, based on maximization of the expected reward. The other neural network, the “value network”, predicts the winner of the game. AlphaGo was introduced to numerous amateur games to help it develop an understanding of reasonable human play. Then it played against different versions of itself thousands of times, each time learning from its mistakes. Over time, AlphaGo improved and became increasingly stronger and better at learning and decision-making. This process is the very essence of reinforcement learning. AlphaGo went on to defeat Go world champions in different global arenas and arguably became the greatest Go player of all time. AlphaZero was a similar program, only built for chess instead.

Reinforcement learning has tremendous scope and applications in the real world. Here are some of its use-cases:

Resources management in computer clusters

Designing algorithms to allocate limited resources to different tasks is challenging and requires human-generated heuristics. RL can be used to automatically learn to allocate and schedule computer resources to waiting jobs, with the objective to minimize the average job slowdown, as seen in the paper “Resource Management with Deep Reinforcement Learning”.

Process Automation

RL can be used to teach computer about work by observing people on their work computers. The AI learns end-to-end processes and generates complete automation's for robots, slashing deployment time of even the most complex processes.

Traffic Light Control

Researchers have designed a traffic light controller to solve the congestion problem at major crossroads. So far, it has been tested on a simulated environment, but their methods showed far superior results than traditional methods and shed a light on the potential uses of multi-agent RL in designing traffic system.

Robotics

Learning-based algorithms have the potential to enable robots to acquire complex behaviors adaptively in unstructured environments, by leveraging data collected from the environment. In particular, with reinforcement learning, robots learn novel behaviors through trial and error interactions. This unburdens the human operator from having to pre-program accurate behaviors.

Personalized Recommendations

Previous work of news recommendations faced several challenges including the rapid changing dynamic of news, users get bored easily and Click Through Rate cannot reflect the retention rate of users. RL has been applied in news recommendation systems by a group of researchers; the work being published in a paper titled “DRN: A Deep Reinforcement Learning Framework for News Recommendation”.

--

--

Ambuj Agrawal
DataSeries

Ambuj is a published author and industry expert in Artificial Intelligence and Enterprise Automation (https://www.linkedin.com/in/ambujagrawal/)