Reinforcement Learning and Asynchronous Actor-Critic Agent (A3C) Algorithm, Explained

Published in

Sciforce

6 min readMar 25, 2021

While supervised and unsupervised machine learning is a much more widespread practice among enterprises today, reinforcement learning (RL), as a goal-oriented ML technique, finds its application in mundane real-world activities. Gameplay, robotics, dialogue systems, autonomous vehicles, personalization, industrial automation, predictive maintenance, and medicine are among RL’s target areas. In this blog post, we provide a concrete explanation of RL, its applications, and Asynchronous Actor-Critic Agent (A3C), one of the state-of-the-art algorithms developed by Google’s DeepMind.

Key Terms and Concepts

Reinforcement learning refers to the type of machine learning technique enabling an agent to learn to interact with an environment (area outside the agent’s borders) by trial and error using reward (feedback from its actions and experiences).

The agent is a learning controller taking actions in the environment and receives feedback in the form of reward.

The environment, space where the agent gets everything needed from a given state. The environment can be static or dynamic, and its changes can be stochastic and deterministic correspondingly. It is usually formulated as Markov decision process (MDP), a mathematical framework for decision-making development.

The agent seeks ways to maximize the reward via interacting with the environment instead of analyzing the data provided.

Reinforcement learning lets an agent to learn to interact with an environment (area outside the agent’s borders) by trial and error using reward (feedback from its actions and experiences).

However, real-world situations often do not convey information to commit a decision (some context is left behind the currently observed scene). Hence, the Partially Observable Markov Decision Processes (POMDPs) framework comes on the scene. In POMDP the agent needs to take into an account probability distribution over states. In cases where it’s impossible to know that distribution, RL researchers use a sequence of multiple observations and actions to represent a current state (i.e., a stack of image frames from a game) to better understand a situation. It makes it possible to use RL methods as if we are dealing with MDP.

The reward is a scalar value that agents receive from the environment, and it depends on the environment’s current state (St), the action the agent has performed grounding on the current state (At), and the following state of the environment (St+1):

Policy (π) stands for an agent’s strategy of behavior at a given time. It is a mapping from the state to the actions to be taken to reach the next state. Speaking formally, it is a probability distribution over actions in a given state, meaning the likelihood of every action in a particular state.

In short, policy holds an answer to the “How to act?” question for an agent.

State-value function and action-value function are the ways to assess the policy, as RL aims to learn the best policy.

The value function V holds an answer to the question “How good the current state is?”, namely an expected return starting from the state (S) and following policy (π).

Sebastian Dittert defines the action-value of a state as “the expected return if the agent chooses action A according to a policy π.”

Correspondingly, it is the answer to “How good current action is?”

Thus, the goal of an agent is to find the policy (π) maximizing the expected return (E[R]). Through the multiple iterations, the agent’s strategy becomes more successful.

One of the most crucial trade-offs for RL is balancing between exploration and exploitation. In short, exploration in RL aims at collecting experience from new, previously unseen regions. It potentially holds cons like a risk, nothing new to learn, and no guarantee to get any useful further information.

On the contrary, exploitation updates model parameters according to gathered experience. In its turn, it does not provide any new data and could not be efficient in case of scarce rewards. An ideal approach is making an agent explore the environment until being able to commit an optimal decision.

Reinforcement Learning vs. Supervised and Unsupervised Learning

Let us start with defining the critical aspects of RL. Check out the crucial facets in the infographic below:

critical aspects of reinforcement leraning

Comparing RL with AI planning, the latter does cover all aspects, but not the exploration. It leads to computing the right sequence of decisions based on the model indicating the impact on the environment.

Supervised machine learning involves only optimization and generalization via learning from the previous experience, guided with the correct labels. The agent is learning from its experience based on the given dataset. This ML technique is more task-oriented and applicable for recognition, predictive analytics, and dialogue systems. It is an excellent option to solve the problems having the reference points or ground truth.

Similarly, unsupervised machine learning also involves only optimization and generalization but having no labels referring to the environment. It is data-oriented and applicable for anomaly and pattern discovery, clustering, autoencoders, association, and hyper-personalization pattern of AI.

Asynchronous Advantage Actor-Critic (A3C) Algorithm

The A3C algorithm is one of RL’s state-of-the-art algorithms, which beats DQN in few domains (for example, Atari domain, look at the fifth page of a classic paper by Google Deep Mind). Also, A3C can be beneficial in experiments that involve some global network optimization with different environments in parallel for generalization purposes. Here is the magic behind it:

Asynchronous stands for the principal difference of this algorithm from DQN, where a single neural network interacts with a single environment. On the contrary, in this case, we’ve got a global network with multiple agents having their own set of parameters. It creates every agent’s situation interacting with its environment and harvesting the different and unique learning experience for overall training. That also deals partially with RL sample correlation, a big problem for neural networks, which are optimized under the assumption that input samples are independent of each other (not possible in games).

In case of Asynchronous Advantage Actor-Critic (A3C) Algorithm, we’ve got a global network with multiple agents having their own set of parameters — In this case, we’ve got a global network with multiple agents having their own set of parameters

Actor-Critic stands for two neural networks — Actor and Critic.

The goal of the Actor is in optimizing the policy (“How to act?”), and the Critic aims at optimizing the value (“How good action is?”).

Thus, it creates a complementary situation for an agent to gain the best experience of fast learning.

Advantage: imagine that advantage is the value that brings us an answer to the question: “How much better the reward for an agent is than it could be expected?” It is the other factor of making the overall situation better for an agent. In this way, the agent learns which actions were rewarding or penalizing for it. Formally it looks like this:

Q(s, a) stands for the expected future reward of taking action at a particular state
V(s) stands for the value of being in a specific state

Challenges and Opportunities

Reinforcement learning’s first application areas are gameplay and robotics, which is not surprising as it needs a lot of simulated data. Meanwhile, today RL applies for mundane tasks like planning, navigation, optimization, and scenario simulation in various verticals chains. For instance, Amazon used it for their logistics and warehouse operations’ optimization and for developing autonomous drone delivery.

Simultaneously, RL still poses challenging questions for industries to answer later. Given its exploratory nature, it is not applicable in some areas yet. Check out some reasons in the infographic below:

Reasons why reinforcement learning is not applicable in different business areas yet

Meanwhile, RL seems to be worth time and resources investment as industry players like Amazon show. Just give it some time, since investment in knowledge always requires it.

We’d love to hear from you regarding RL’s perspectives. Drop us a line in the comments!

Reinforcement Learning and Asynchronous Actor-Critic Agent (A3C) Algorithm, Explained

Key Terms and Concepts

Reinforcement Learning vs. Supervised and Unsupervised Learning

Asynchronous Advantage Actor-Critic (A3C) Algorithm

Challenges and Opportunities

Written by Sciforce