Nerd For Tech
Published in

Nerd For Tech

In the past post, I showed how to solve one of the minigames of PySC2 using the Actor-Critic Reinforcement Learning algorithm. In the case of a minigame that has a simple structure, the agent does not need human expert data for behavior cloning. However, in the case of an actual 1 versus 1 game, the training time increases exponentially due to excessive numbers of states and actions. In this post, I will check that the Alpha Star Model can actually mimic the play of humans using human replay data.


The simplest method for Behavior Cloning is to use the difference between the output action of the network at the replay state and the replay action as the loss value. First, try to do an experiment with the LunarLander-v2 environment of OpenAI Gym, which is a simple environment, to see if this method actually works for Actor-Critic Reinforcement Learning.

The code and dataset for that explanation can be downloaded from Google Drive.

First, I change the environment, network size of the Actor-Critic official code of Tensorflow and train it until the reward reaches 200. I use this trained model to create expert data using the below code.

Code for collecting the expert data of A2C for Behavior Cloning

Please try to understand every part of the original code because I reuse code of it to make a Supervised Learning part. Next, the loss is calculated from the difference between the action of expert data and the action of the network.

Supervised loss for A2C

Unlike the loss of Reinforcement Learning, there is no loss for the critic network calculated by reward. You can start Supervised Learning by running the below cell of the shared Jupyter Notebook file.

The main loop of A2C for Behavior Cloning

The train_supervised_step is a function corresponding to the train_step function of the original code. You can train agents using the Reinforcement Learning method after finishing behavior cloning.

The train_supervised_step function of A2C for behavior cloning

The run_supervised_episode function corresponds to the run_episode function that collects the data from the actual environment. This function collects expert data for Supervised Learning from the npy type files that we made before.

The run_supervised_episode function of A2C for Behavior Cloning

We can use the newly created Supervised Learning to train the Reinforcement Learning. First, save the action logit of model what we are going to train and the supervised model together.

Run Reinforcement Episode with Supervised Learning model

The Kl Divergence method can be used to calculate the difference between action logits because they are a probabilistic distribution.

Computing KL Loss

The KL loss calculated is added to the RL loss to train the model.

Training Reinforcement Learning with KL loss

In this way, the agent can learn faster than using Reinforcement Learning alone, as shown in the graph below.

Reward Graph of Supervised A2C

Supervised Learning for Starcraft 2

In the case of the LunarLander-v2 environment, it can be solved sufficiently by using only Reinforcement Learning. However, it is common to use Supervised Learning together as in the case of Starcraft 2 which has complicated observation and action space to shorten the training time. In the case of the LunarLander-v2 environment, data can be generated using the model trained from Reinforcement Learning, and loss is calculated for only one action. However, for Starcraft 2 case, humans need to play the game directly to collect data, and action type and all 13 action arguments need to be considered for loss.

  1. Replay file of Terran vs Terran at Simple64 map using two barracks marine rush:
  2. Supervised Training code for PySC2:

After downloading the replay file, you need to convert it to hkl file format. See for detailed instructions.

The code below basically uses One-Hot Encoding, Categorical Cross-Entropy, and L2 Regularization to calculate the loss for Supervised Learning in Pysc2, similar to the LunarLander-v2 environment.

The loss function of PySC2 Supervised Learning

However, in the case of Starcraft2, even if a human executes a specific action type, it may not be reflected depending on whether it is available in the current game. Therefore, it is filtered via the available_actions feature. Furthermore, even in the case of the action argument, it is also filtered because specific action types only need a limited action argument.

The following graph is a loss of Supervised Learning for Starcraft2. Due to the complexity of action and observation space, it takes a very long time to train.

Training loss graph of Starcraft 2

The following video is the evaluation result of the trained model around 5 days using about 1000 expert data. We can confirm that agent imitates human behavior to some level.

Evaluation of Supervised Learning of Starcraft 2


In this post, we investigate the Supervised Learning method using expert data in Actor-Critic Reinforcement Learning in the case of Starcraft 2. Considering that almost all actual environments outside the laboratory are complicated. Human data and loss calculation for multiple action methods should be considered essential for agent development of high performance.




NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

Neural Networks: Inside the Black Box

Diving into machine learning

Image Classification between Dog and Cat using ResNet in PyTorch.

Text-to-Text-match model design

Automated Hyperparameter tuning

A day in the life of an Algorithmic Artichoke

Linear Regression with PyTorch

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dohyeong Kim

Dohyeong Kim

I am a Deep Learning researcher. Currently, I am trying to make an AI agent for various situations such as MOBA, RTS, and Soccer games.

More from Medium

Understanding Generative Adversarial Networks

[CV-Research] Industrial Anomaly Detection without Anomaly Training Samples (PatchCore and its…

Playing MOBA game using Deep Reinforcement Learning — part 3

Spoken word classification — Tensorflow Speech Recognition Challenge