Reinforcement learning is the leading field in artificial intelligence right now. New algorithms are being released at an incredible rate. Asynchronous Advantage Actor-Critic or A3C is an algorithm released by Google’s Deepmind. The algorithm proved to be faster, simpler, and better than most existing algorithms. What is the magic that lies behind this algorithm? First, check out what I built:
What I Built
Here is a video of my implementation of an A3C algorithm on the Space Invaders game on the OpenAI gym environments.
So, how does it work? Let’s learn.
Asynchronous Advantage Actor-Critic
I’ll go over them in a different order to make it easier to understand.
The Actor-Critic model allows us to use the best of both worlds of value based reinforcement learning algorithms (Q-Learning, Deep Q-Learning) and policy based (PPO). Actor-Critic consists of two neural networks, the Actor and the Critic. The actor is the neural network that selects the best actions based on a policy and the critic classifies whether being in a certain state is valuable. You can think of it as an actor being a player of a game and the critic being the observer. The actor doesn’t know how to play but improves over time with the feedback from the critic. The actor tries to optimize the policy and the critic tries to optimize the value.
The first A refers to Asynchronous, which allows the agent gain more experiences. There are multiple instances of agents, that have been initialized differently in their own separate environments. Each agent than begins to take actions and go through the reinforcement learning gather their own unique experiences. These unique experiences are then used to update the global neural network which is shared by all of the agents. This network influences all the actions of the agents and every new experience from each agent improves the overall network faster. Since there are multiple instances of this agent, training will be much faster and better.
The advantage is the value that tells us if there is an improvement in a certain action compared to the expected average value of that state based on . The advantage formula is
The Q(s, a) refers to the Q value or the expected future reward of taking an action at a certain state. The V(s) refers to the value of being in a certain state. The goal of the model is to maximize the advantage value.
Putting It All Together
Using the 3A’s of the algorithm we can build a very powerful reinforcement learning model that is one of the most powerful in the world. Some improvements like adding an Long Short-Term Memory layer boost its performance. Reinforcement learning algorithms are typically used in video games because they serve as environments that are challenging to navigate.
The AI is only able to see what a human would be able to see. It has no back-end information, it’s only data source is what is visible on the screen, including score, health, the state of the game, etc. This means that the data is an image and makes it perfect for a convolutional neural network to be used. This CNN will take the screen as input and then extract the features from this image which will output some data.
This data is then passed on to an LSTM layer that will pass on values to a fully connected layer. This LSTM layer allows the model to have memory so it can remember from past experiences and make decisions based off of that. The LSTM layer uses a cell that looks like this:
The combination of these functions allows this neural network to remember experiences from the past and utilize them to better the agents performance.
This LSTM layer passes the data to the ouput layer and an action is selected for the actor neural network. The value is also passed to the critic neural network where the value is updated. The neural network’s weights are updated by calculating the value loss for the critic and the policy loss for the actor and then backpropagate those errors.
This algorithm has become a go to algorithm for machine learning and has proven to be successful in a lot of arcade game environments and has performed better than existing algorithms such as Deep Q-Learning, modified Deep Q Networks, etc. Reinforcement learning has seen big advancements with a notable achievement being AlphaGo which was an AI that beat the world’s best player at the ancient board game of Go. From here on, reinforcement learning is going to grow faster than ever and soon we’ll be able to teach machines to do anything.
If you enjoyed this article please like, share, and comment what cool advancements you’ve seen in the reinforcement learning field! As always I hope you learned something new and had a fun time reading! Thanks!!!