The unknown origin and EC application of Reinforcement Learning

Masaya Mori 森正弥
7 min readJul 20, 2020

--

Thanks to the recent boom in “deep learning,” AI is sometimes seen to find laws from data. That is precisely what we call “supervised learning” which is a branch of machine learning to identify and derive laws based on data given in advance regarded as “training examples from teacher.”

In contrast, generally, those who are not familiar with machine learning may think of something more like “Reinforcement Learning” when hearing the word “AI”. It is other branch of machine learning that improves its behavior by experiencing that is not by learning from data, but by first taking action and learning from the result. Reinforcement learning differs from supervised learning because it doesn’t receive the information from teacher in advance but instead obtains feedback information, in the form of rewards, after taking action, which is used as a cue for further learning. The premise is that reinforcement learning assumes an uncertain environment. The rewards received in reinforcement learning differ from teacher information in terms of nature. It could include noise or be delayed. So, reinforcement learning is suitable for reflecting each result and user’s reaction step by step and optimizing system and service gradually.

In the past, reinforcement learning was often used in the field of robotics, but nowadays, it has become famous as a method which AI uses to win the game.

In 2016, AlphaGo defeated Go player Lee Sedol; It was a supervised learning AI that learned a lot from historical game records. In 2017, AlphaGo Zero, which completely beat AlphaGo, did not use historical game records and only used deep reinforcement learning to improve its skills in self-match.

In a new effort, DeepMind which delivered AlphaGo/AlphaGo Zero has built AlphaStar, which aims to be the most powerful in StarCraft 2, a real-time strategy game. In January 2019, it played two of the top professional players and won 10 games in a row. AlphaStar is an advanced combination of supervised learning and reinforcement learning in Deep Neural Network. While based on data, it took action on its own and got feedback to become stronger. The model of it became more like the intelligent behavior of humans and their growth.

The Unknown origin

By the way, such “Reinforcement Learning” actually has its roots in psychological theory from experiments on animal physiology.

Ivan Pavlov, a physiologist from Imperial Russia and Soviet Union, began his research in 1901, and in 1903 he published his theory of the “Conditioned Reflex”, an experiment commonly referred to as “Pavlov’s Dog”.

1.Let the dog listen to the metronome.

2.Give the dog some food. The dog eats the food.

3.Repeat 1 and 2.

4.then, when the dog just listens to the metronome, it salivates.

This famous experiment showed the existence of a “reflex that predicts what will happen next at the unconscious level”. That became the source of intellectual exploration for psychologist and founder of behavior analysis, Barras Skinner who was born in 1904, the year after Pavlov’s experiments with dogs. He appreciated Ivan Pavlov’s experiments and introduced them to the West and used them as the basis for his own ideas. Skinner believed that human behavior depends on the results of past actions.

If past behavioral outcomes have been bad, there is a high probability that the behavior will not be repeated; if it has been a good outcome, it is likely to be repeated again and again. Summing up his studies, he named it the “Principles of Reinforcement Theory” that behavior can be controlled by the degree of reward or punishment as a direct result of the behavior. It means that the behavior is “reinforced” by the feedback of the results. His concept was directly adopted as Reinforcement Learning in machine learning, and evolved into the third field of machine learning, after Supervised Learning and Unsupervised Learning.

The goal of reinforcement learning

We mentioned in the previous section that an environment of uncertainty is assumed in reinforcement learning. The goal is to learn how to cope with an uncertain environment in order to maximize rewards.

Let’s take an example of your playing a soccer.

You’re attacking up to the opponent’s soccer goal and holding onto the ball. The opponent players are running next to you and they could be going to get your ball at any moment. Your team member is also running diagonally in front of you, but he got marked. The opposing goalie seems to be very aware of both of your moves.

You can kick a ball yourself, or dribble and then kick it, or pass it to a team member. Whichever action you take, you take the next action in hopes of seeing how it will help you take advantage and get closer to winning a point. In short, you will do the action with an aim to get a point. And depending on how you play, the surroundings around you will change as well. If you dribble, the goalkeeper might jump out at you, or maybe your teammate will run to a place where he can receive the ball more easily. The game goes on like that. Accordingly, you will receive feedback on whether the actual results of your actions led to a point or not. Through such a trial and error process, you will improve your football skills and scoring ability.

Of course, you have to follow the rules of football. If you beat up an opposing player who is marking you, you might score a goal, but you shouldn’t do that. Within the rules, or constraints, you’ll be honing your skills in practice to maximize your chances of scoring points.

So far, if you follow this, you will actually create a model that looks like:

Based on the surroundings of which you are aware, the input is your action and the output is whether or not a goal is scored by the play caused by your action. You go through the game, and at last, the final reward is how many points you got at your game set. (This explanation is a bit too simple, though.)

Reinforcement learning will get feedback from these above results to learn and acquire better decisions.

Applications of reinforcement learning to EC

While you can see remarkable use cases of the reinforcement learning by AlphaGo and others, it’s also widely applied in e-commerce.

One example is Multi-armed bandit algorithm. It’s used to improve recommendations and search results by incorporating customer responses like which products they clicked, how fast they responded, etc, as rewards. In other use cases, it’s utilized for personalizing advertisements or AB tests.

Multi-armed bandit algorithm is a method for maximizing rewards by using two different types of behavior with limited resources: “exploitation” of previously experienced means to see how much reward (in this case, a positive response from the user) can be obtained, and “exploration” of unknown means that may yield additional rewards. In other words, you pursue the best choice within a set number of trials, combining what you know because you’ve done it before and what you haven’t done and want to check by trying it.

Let’s think of an AB test where we create two ad plans (Plan A and Plan B) and display them to 50% of the users in each plan. We see KPIs simply as the ratio of clicks to impressions. At this point, it would be good if the number of clicks on both plan A and plan B were about the same, but if the number of clicks on plan B is desperately low, even though it’s in the AB test period, it’s a total loss of opportunity to show plan B to 50% of users because it’s a production service.

So, using multi-armed bandit algorithm, if plan B is performing poorly, you can reduce the percentage of plan B and increase the percentage of plan A dynamically to maximize the KPI and reduce the loss as much as possible.

You can create two ideas of advertisement manually. However, for instance, if you had to create 60 ideas of advertisement, or create more diverse ideas according to detailed user segments, it would be impossible to adjust the impressions for each one and raise the KPIs as a whole by hand. In such kind of cases, there are advantages to applying reinforcement learning techniques to automate the process.

Reinforcement learning is less of a lost opportunity than manpower. However a trade-off is established between the exploitation of experience and the exploration of further rewards. So, there is always the challenge of how to do it and how far to go in order to minimize lost benefit. What you choose to do must be based on historical data and incomplete estimates.

Typical techniques of multi-armed bandit algorithm are epsilon-greedy, UCB and Thompson sampling. UCB algorithm is widely used because it can answer this challenge within a loss limit. However, it’s empirically known that Thompson Sampling is more powerful in many cases.

The future

Like DeepMind’s advanced use of the aforementioned, you will see more examples of Deep Neural Network-based reinforcement learning combined with supervised learning. The application of reinforcement learning may be essential for developing AI system, which is more complex and sophisticated, in the future.

In this article, I summarized reinforcement learning, including things that are not mentioned in other articles. If that helps well, I’m very happy.

--

--

Masaya Mori 森正弥

Deloitte Digital, Partner | Visiting Professor in Tohoku University | Mercari R4D Advisor | Board Chair on AI in Japan Institute of IT | Project Advisor of APEC