The Dynamic Duo — Deep Learning within Reinforcement Learning

Rodrigo Del Aguila
QMIND Technology Review
7 min readMay 15, 2023

A comprehensive guide on an AI sub-domain that is pushing the limits of Machine Learning.

Source: DALL-E 2 (prompt: robot playing chess, oil painting)

DL + RL = DRL

Let’s explore an area that is shaded by the current LLM/Generative AI hype: Deep Reinforcement Learning (DRL). For those with no prior knowledge on Machine Learning (ML) concepts/terminology, do not fear! I will provide definitions, visuals, and examples in my explanations.

To start, we have introduced DRL, which is a combination of two prominent areas of ML: Deep Learning & Reinforcement Learning.

Deep Learning: ever heard of the term Neural Network? If so, you are already somewhat familiar with this topic (or at least the terminology). The formal definition of Deep Learning (DL) is a bit more boring:

“Deep Learning (DL) is essentially a multi-layer neural network used to mimic the behaviour of data processing within our brains.”

Source: https://towardsdatascience.com/everything-you-need-to-know-about-neural-networks-and-backpropagation-machine-learning-made-easy-e5285bc2be3a

The interconnected circles from the image above are called nodes. For now, think about them as neurons in the brain, connected to other surrounding neurons forming a massive network. Brain signals can be used to explain the flow of data in the above neural network diagram:

  1. Let’s say you stub your toe while running. While I do not recommend anyone try this at home, if this unfortunate event occurs, this would be a physical demonstration of a neural network (your brain) at work. Firstly, your nerves (input data) send a signal to your brain to be interpreted as pain (outcome). This is the first stage of the DL data flow — it is important to note that the desired outcome, or prediction, is just as crucial as the data being fed into the neural network.
  2. Moving on, once a signal — not yet identified as pain — has been sent to the brain, it is picked up by some neurons (input layer). As depicted by the arrows in the diagram, the initial set of neurons carry the signal on to the next set of neurons (hidden layer). At this stage, the signal starts to be decoded and interpreted. The brain is getting closer to the end result.
  3. Lastly, after many million computations and neurons affected by the signal, the brain returns its outcome in the form of a loud “OUCH!” (output layer).

Note: I have included extra resources if you would like to learn more about certain algorithms, models, and even the mathematics involved in DL in the Quick links/Resources section.

Reinforcement Learning: remember when DeepMind’s AlphaGo beat three-time world champion, Fan Hui, in 2015? Me neither — I was probably trying to build piston doors in Minecraft. When I read about it a few years later, however, I was shocked; how could a machine that is not human and incapable of understanding what the highly-complex game Go was about, beat a world champion with a perfect 5–0 score? The answer: Reinforcement Learning (RL). But what is it?

“Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.”

Source: https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292

One of the fundamental concepts of RL is feedback, more specifically an action-reward feedback loop shown in the above image (formally referred to as a Markov Decision Process). Instead of listing formal definitions for every term in the diagram, a simple example is a more effective way of explaining the elements of RL:

  1. Assume you (agent) are just starting to learn how to ride a bike (action). Obviously, the first time you attempt to go forward a few meters, you will almost certainly fall or lose balance. Let’s digest this process further: as the agent, by trying to ride the bike, you are interacting with your environment which includes the road, any obstacles ahead, and of course the bike itself.
  2. Perhaps you keep falling off your bike for the next fifty times you attempt the feat. But after every iteration, you gradually learn how to balance yourself by applying just the right amount of force while pedalling to not suddenly tip the bike one direction, for example. In short, you always receive a reward (successful bike ride from point A to point B) or punishment (falling off bike or losing balance) and adjust your future actions based on the outcome of the previous one. This is how you learn the correct techniques on how to ride a bike, by repeating the movements that kept your bike upright, and not repeating the ones that caused you to fall.
  3. After a few hundred more iterations, you are a natural! You can easily maneuver around any obstacle and are now considering trying out for the X Games.

As you can probably imagine, RL is applicable to a multitude of both human and machine processes in the world today. One thing that is common amongst all RL applications is the end goal: to maximize the total reward accumulated over the course of the learning process.

Now that we have covered the basics of both DL and RL, we can properly define Deep Reinforcement Learning (DRL):

“Deep Reinforcement Learning combines artificial neural networks with a framework of reinforcement learning that helps software agents learn how to reach their goals.”

In equation form: RL + DL = DRL

We can break down the above example of AlphaGo to see how the principles of DL and RL merge to form DRL:

  1. In order for a Go player to win the game, they must evaluate certain positions very well and also be able to look multiple steps ahead; essentially simulating different outcomes. For expert Go players, these skills can take years or even decades to develop through countless games being played. For AlphaGo, using the same methodology, this only took three weeks.
  2. AlphaGo’s model is trained on more than 30 million distinct board positions. For each simulated game, the algorithm is based on an RL policy network — the actions (moves) that the agent (AlphaGo) takes as a function of its state (board position). Typically, policy networks are implemented for the purpose of optimizing the end result, which is why a parameter is thrown into the equation as well. This parameter is modified to make the RL policy easier to win, and the next move is determined by the best game result out of the thousands of simulations for each turn.
  3. The DL component compliments the RL policy described above. Actually, it is a copy of the policy but with a twist: it uses Supervised Learning and backpropagation to train a different parameter and is called the SL policy network. The purpose of the SL policy is to imitate expert moves given the millions of board positions as inputs. By applying backpropagation, errors from the current iteration are fed back to fine tune weightings for each layer of nodes in the DL framework, ultimately improving the predictions for AlphaGo’s next move.

In summary, the algorithm integrates the principles of DL to aid in the rapid decision making of the software agent, more specifically Deep Neural Networks in conjunction with Monte Carlo Tree Search RL programs.

So what?

DRL models and algorithms like AlphaGo have been proven to produce record-breaking accuracies within several use cases. Here are some to list a few:

  • Two-player/multi-player games: virtual agents were able to achieve superhuman-level performance in games requiring complex and diverse tasks that are also based on high-dimensional inputs
  • Robotics: Robust Adversarial RL is applied in the field of robotics to train machines experiencing destabilizing forces to learn an optimal destabilization policy (video below shows visual examples of how this works)
  • Self-driving cars: DRL is often used with autonomous driving. Scenarios regularly involve the interaction between agents and environments, requiring accurate and dynamic decision-making. The video below shows the lens of Tesla’s Autopilot.
  • Healthcare: For patients suffering from chronic conditions, DRL has enabled advancements in the medical field by providing personalized medicine that is used to optimize patient health care using an RL framework (improved upon by user feedback)

Recap

A lot of new information can unsurprisingly be a lot to take in. So, I have summarized the article into some key points for every section/topic; this can serve as either a quick reference to brush up on high-level knowledge, or for the TLDR audience.

Deep Reinforcement Learning (DRL)…

  • Is a remarkable subset of Machine Learning that has gained significant attention in recent years due to its ability to solve complex problems that were previously impossible for traditional RL methods.
  • Builds on the strengths of both DL and RL to overcome some of their limitations and achieve outstanding performance in various applications.
  • Combines the power of Deep Learning and Reinforcement Learning to help software agents learn how to reach their goals by interacting with their environment.
  • Has been used in various applications such as robotics, gaming, and autonomous driving to achieve superior performance compared to traditional RL methods.

Quick links/Resources

Deep Learning:

Reinforcement Learning:

Deep Reinforcement Learning:

This article was written for QMIND — Canada’s largest undergraduate community for leaders in the AI space.

--

--