Reinforcement Learning — Basic understanding

A tangible approach to an area of machine learning in full swing

Anxo Rey Masid
SecuritasDirect
6 min readJan 30, 2021

--

Context

Surely you have heard about reinforcement learning [1] or, if not, surely you have been in contact implicitly with one of its applications. In the middle of 2021, there are many areas in which it has gained special relevance, but perhaps those in which it is more present, and has meant a qualitative leap compared to other areas of machine learning, are games/video games and the robotics industry.

A clear example of this within the world of robotics is the advances made by the company Boston Dynamics [2], founded by Marc Raibert in 1992 and bought by Google in 2013. Although the company has been working on technological development for years (interactive simulations in 3D at first), it is true that in the last 5 years it has experienced growing popularity due to the presentation of several of its new products and the media impact that these have had.

From left to right, Spot and Atlas (Boston Dynamics).

In the image on the left we see Spot [3] and Atlas [4], two robots belonging to the company’s product range that can perform complex tasks autonomously and that find great applicability in the industrial field due to their efficiency and independence. The indications of their creators say that these small-large automatons could take over tasks of surveillance, protection and extraction of elements without receiving direct and explicit orders from a human. A true revolution.

But without a doubt, if we had to point out a milestone reached by reinforcement learning in recent years, it would be the appearance of AlphaGo [5] and later of AlphaGo Zero [6]. Both are computer programs belonging to DeepMind [7], an English company created in 2010 and acquired (also) by Google in 2014.

The first of them, AlphaGo, was presented in 2016 as a demonstration that a machine could teach itself how to perfect the game strategy for something as highly complex as Go [8] and, in this way, end up beating the South Korean champion Lee Sedol that same year. The impact of this victory was not only transcendental for the scientific community but also showed the limits of this type of applications and raised a very complex and beneficial debate.

Lee Sedol playing against AlphaGo in March 2016.

Just a year later and under the name of AlphaGo Zero, DeepMind presented an improved version of its AlphaGo with a peculiarity: in this case, it had not been trained with data from real player games but had based its experience on games simulated by himself. Indeed, as expected, he beat the previous version, and not only that, but after 40 days of training, he had become the best intelligence capable of facing a game of Go on the planet: the perfect player.

The history and evolution of DeepMind developments continue today with programs such as AlphaZero [9], capable of extending its learning to chess and Shogi, and MuZero [10], which brings artificial intelligence to Atari games such as Pac-Man and Space Invaders.

Understanding

At this point, surely the reader will be wondering if these advances are due only to our increase in available computing capacity, in the technologies developed, or what exactly causes such advances.

The truth is that computing capacity has allowed a qualitative leap in terms of performance (computing time), but perhaps the most important thing is the change in approach applied to solving these problems.

The programs created by DeepMind mentioned above do not need their code to explicitly collect the commands to develop a strategy: they follow a process of reinforcement learning similar to that of animals. In other words, these programs are the perfect balance between exploration and exploitation.

Photo by Florian Olivo on Unsplash

In the first interactions of the machine with the game, the weight of the decisions will be much more focused on exploration, since it does not yet have a learning history to be based on it; as time goes by, this will be reversed very little by little and the machine will make decisions based on its particular experience, in which it will have explored almost all the options: consider, for example, a child who begins to walk and he tries different intensities each one of his legs, etc.

At this point, we are going to see a more formal definition of the elements that make up these types of resolutions.

Formalisms

To give a more rigorous approach to reinforcement learning, we can understand it as a Markov Decision Process (MDP), given by a 4-tuple (S, A, Pₐ, Rₐ) that meets the following definition:

  • S represents the set of possible states (state space).
  • A represents the set of possible actions (action space).
  • Pₐ(s, s’) = P(sᵤ=s’|sₜ=s, aₜ=a) is the probability of transition from state s to s’ through the action a, where u=t+1.
  • Rₐ(s, s’) is the reward obtained after going from s to s’ with the action a.
Reinforcement learning standard workflow.

To give an interpretation to this formal definition, we can understand the approach in the following way:

an agent (person) within an environment (the closest physical environment that surrounds him to the real world) trying to achieve a goal (standing) whose strategy he does not know (he does not know yet). To do this, it will test, during discrete iterations (he will try over and over), different actions (push off with one leg, with both at the same time, etc.) and collect the information of success or failure for each of them, known as reward (if with one leg it achieves get up a little or not, etc.), to finish (stand up).

Conclusion

With all this, we have defined and interpreted a formal environment on which to solve problems through the approach provided by reinforcement learning. At this point is where it gets really interesting since we could apply some approximation algorithm to obtain the solution to any of these problems and obtain, de facto, the verification of this approach which represents an advance over the more traditional ones.

To be continued…

Notes

[1] Reinforcement Learning

[2] Boston Dynamics

[3] Spot

[4] Atlas

[5] AlphaGo

[6] AlphaGo Zero

[7] DeepMind

[8] Go

[9] AlphaZero

[10] MuZero

--

--