Beginners Guide To Deep Reinforcement Learning

Published in

Abacus.AI Blog (Formerly RealityEngines.AI)

6 min readMar 23, 2020

Before we begin, it’s important to make a distinction between the two techniques in question — Deep Learning (DL) and Reinforcement Learning (RL). By imitating the processes in the human brain, DL uses neural networks to discover patterns in data and make decisions. RL, on the other hand, is about software agents learning how to act in an environment in order to maximize a certain reward. When the two are brought together,the magic begins. Also called Deep Reinforcement Learning (DRL), its algorithms are capable of feats such as beating the world champions at Go.

How does Reinforcement Learning work?

RL is a subset of Machine Learning, along with Supervised and Unsupervised learning. Supervised Learning trains on data that has both inputs and the correct inputs already in place. The goal is then to predict outputs for future cases. Unsupervised Learning, on the other hand, is used in order to detect hidden patterns and structures in data when little is known about it.

RL takes a third approach which, simply put, works like using a carrot and stick with a computer program. The program can make certain “moves” that bring rewards, while others take them away. To be precise, the mathematical framework describing RL is called a Markov Decision Process. In this framework, an agent (our software) is located in an environment where it can make certain actions. By observing its surroundings, the agent performs actions that change the current state and grant or take away rewards.

Actions can mean things like moving a chess piece on the board or turning left in an autonomous car. Rewards are the values that we want our algorithm to maximize. In chess, rewards are the captured pieces and, ultimately, checkmate. For an autonomous car, the values are completely different — the main goal is to reach point B from point A in the shortest amount of time while ensuring the safety of the passengers.

Generally speaking, there are three main ways to approach RL problems:

Policy based. A policy is an exact way the agent performs a certain task, so in this approach, the focus is on finding the optimal policy among all available ones.
Value based. The focus is to find the optimal value and maximize the cumulative reward.
Action based. The focus is on what optimal actions to take at each step

RL techniques are powerful on their own, however, the addition of DL makes it possible to solve a much broader set of problems. A DRL agent, for example, can successfully learn from visual perceptual inputs made up of thousands of pixels. This opens up the possibility to mimic human problem solving capabilities, even in high-dimensional space.

For example, researchers over at Google DeepMind applied DRL to the domain of Atari 2600 games. Receiving only the pixels from the game and the current score as inputs, their agent was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games.

Important applications of Deep Reinforcement Learning

Due to its nature, DRL finds pretty obvious applications in playing games, both classical and video games. Among the top accomplishments are the AlphaGo and OpenAI systems, both of which were capable of beating the world champions in Go and the esport game Dota 2, respectively. However, the applications can go far beyond what is considered “fun”.

For example, DRL was used to improve the online environment of Taobao, a Chinese retail company owned by Alibaba. It is one of the largest e-commerce websites in the world, with over 600 million monthly active users. DRL algorithms were unleashed on a virtual copy of the Taobao website, set on learning from hundreds of millions of customers’ records in order to improve commodity search. As a result, new policies were trained that have improved Taoboa’s online performance significantly.

Royal Dutch Shell is another example of successful DRL implementation, this time from the oil and gas industry. As part of its quest to find cleaner sources of power and stay on top of the evolving energy market, Shell has made significant investments in the research and development of AI. By using DRL, Shell has managed to bring down the cost of gas extraction, among other general improvements of its supply chain. Imagine the Earth’s crust as a playing field and the drilling machine as a player. By using historical drilling data and DRL, the machine can optimize its path in search of gas, minimizing power consumption and steering clear from potential hazards.

Deep Reinforcement Learning for Recommender Systems

One DRL application of particular interest is in Recommender Systems (RS). When a user clicks between news, or music tracks, or shows on Netflix, this is considered as a move in a game. Correct moves increase user satisfaction, incorrect ones decrease it — a simple model, but effective. There are multiple approaches to building RS, including neural networks, but there are two primary limitations that can be alleviated by using specifically DRL.

1. Most RS approaches treat the user experience as static

According to one study, users that consecutively received satisfactory recommendations tended to rate those items at an increasingly positive rate. That is, if a user kept seeing what they expected, that feeling of satisfaction would cause them to leave better marks. The opposite is also true, with unsatisfactory recommendations leading to lower marks. This effect was demonstrated by researchers on public datasets available from MovieLens and Yahoo! Music. On consecutive counts, the average rating on MovieLens was shown to grow from 3.5 to 3.9 in a favorable outcome, while it fell to 3.0 in case of the opposite. DRL is capable of seeing this dynamic aspect of user-system interaction and taking this bias into account.

2. There is a tendency to focus on immediate feedback

By looking several moves ahead, DRL systems are capable of assessing which items will bring greater long-term satisfaction. If two items offer the same immediate reward, but one leads to greater user engagement, that item will be considered a better recommendation. Imagine a case when you are offered two news articles — one about the opening of the 2020 Olympics, the other about a local cat being saved from a tree. After reading about the cat, you will probably be finished with the story. Reading about the Olympics, on the other hand, can lead you to click on all sorts of related articles. It’s important to include this difference in behavior when building a good RS. Just as it’s important to save cats from trees!

Currently, there is a lot of activity going on related to building decent DRL recommender systems for news websites. Given the dynamic nature of news recommendations and user preferences, there are a lot of issues the systems need to tackle and many new approaches are being tested. Researchers over at Microsoft Research Asia have successfully run their Deep Q-Learning framework on a commercial news website for one month, demonstrating significant improvements of recommendation accuracy and recommendation diversity.

RealityEngines.AI and Deep Recommender AI

You can also choose to quickly try deep-learning based recommendations and personalization models and understand how they work by requesting an invite to try our service — http://realityengines.ai