Learning Multi-Objective Games Using Bottom-Up Reinforcement Learning

Individual Images borrowed from Google Images

Generally, in Reinforcement Learning, we get delayed rewards. For example, consider the game “Assault” from the OpenAI Gym Environment. We get rewards from the environment as follows:

When our agent shoots and the bullet hits an enemy ship, we get a reward of 21. But, the action of shooting is performed much earlier in the game, and the reward for performing the action is received much later. Due to this, it takes longer for an agent to learn when to shoot so that it’ll hit the target.

One of the ideas behind Bottom Up Reinforcement Learning is to provide denser rewards to the agent. In the perspective of the previous example, the agent will receive an immediate, relatively positive (reward value may not be positive, but its magnitude is higher) reward when it shoots at the correct time. To achieve this, we use a concept called Inverse Reinforcement Learning.

Also, there are many smaller objectives in Reinforcement Learning which an agent has previously learned. The other motive of Bottom-Up Reinforcement Learning is to reuse these objectives in the training process of a multi-objective game. For example, if an agent has learned how to navigate in one environment and shoot in another environment, we wish to reuse these learned objectives into a game which involves both navigating and shooting. This can be achieved by transferring the reward functions of the smaller objectives into the current game.

To assess the impact of using Bottom Up Reinforcement Learning, we test two hypothesis:

1. Learning the reward function of one game, and transferring it to another game with similar action space.

2. Learning the reward functions of two games with potentially different action spaces and transferring them to another game which is a superset of the union of these action spaces.

Inverse Reinforcement Learning

Andrew Ng, Stuart Russel defines Inverse Reinforcement Learning (IRL) as


  1. Measurement of an agent’s behavior over time, in a variety of circumstances.
  2. If needed, measurements of the sensory inputs to that agent;
  3. If available, the model of the environment,

Determine: The reward function being optimized.

IRL is about learning from humans. Argall et al (2009) defines IRL as a form of Imitation Learning and learning from demonstrations. They seek to learn policies from expert demonstrations, and IRL methods accomplish this by first inferring the expert’s reward function [4].

In simpler words, IRL can be used for 2 tasks:

  1. Inferring the optimal policy by observing the (State, Action) pair of an expert.
  2. Inferring the reward function by creating a mapping of (State, Action) pair to a reward.

How can we use IRL for obtaining denser rewards?

Assigning rewards to an agent for its actions is a difficult task, as it has to be done manually and it should precisely represent the task. For example, rewards in the game “Assault” are very sparse. You achieve a reward of 21 only when you destroy a ship, else you keep getting no reward. Instead, IRL learns assignment of reward directly from given expert data. Using IRL we can make the sparse rewards denser by giving reward for each (State, Action) pair.

We use Adversarial Inverse Reinforcement Learning for learning the reward function from the expert demonstrations and to test our hypothesis in this project.

Adversarial Inverse Reinforcement Learning (AIRL)

Borrowed from https://people.eecs.berkeley.edu/~justinjfu/

The keyword “Adversarial” in the name of the algorithm reflects the training methodology which is analogous to that of a Generative Adversarial Network (GAN). The algorithm is described below in steps:

  1. The first step as observed in any inverse RL algorithm is to collect the expert trajectories (demonstrations) of the game we want to learn from.
  2. The next step is to initialize the policy network and the reward network. The policy network here acts as a pseudo generator while the reward network acts as a discriminator. Their roles are explained in the next steps.
  3. The policy network is now used to generate trajectories by executing the policy it has been learning through optimization. The policy learned maps a given state to a suitable action to be taken in that state.
  4. The policy network is now used to generate trajectories by executing the policy it has been learning through optimization. The policy learned maps a given state to a suitable action to be taken in that state.
  5. The reward network acting as a discriminator tries to distinguish between the trajectories generated by the policy in step 3. and the expert trajectories obtained in step 1. This task of classifying trajectories is done by training the network using binary logistic regression.
  6. The reward value to be maximized for a given state and action is calculated using the trained discriminator.
  7. The final step involves optimizing the policy with respect to the reward value obtained in step 5, which in turn is a value dependent on the weights of the reward network trained in step 4. The policy here is optimized using the Trust Region Policy Optimization (TRPO) [5] algorithm.
  8. Repeat the steps (3–6) in a loop until convergence

One of the reasons for choosing AIRL over other inverse reinforcement learning algorithms is the ease in recovering the reward function learned by the network.

Expert Data

(Left) Without expert demonstrations, the car has several options to go (Right) Car learns from expert demonstrations what to do

This is the first step in training any IRL model.

  1. The data collection for the Atari games (Assault and DemonAttack) have been obtained from an expert pre-trained model made open source by Tensorpack on their GitHub page. The model was trained using the Asynchronous Advantage Actor-Critic (A3C) [6] algorithm.
  2. For the variants of the game Doom, we made use of an expert model trained using the Proximal Policy Optimization (PPO) [7] algorithm provided in the OpenAI baselines.

Adaptation of AIRL to Atari Games

The AIRL algorithm, when proposed in the original paper experimented on games with continuous control like a pendulum, ant, etc., but the games chosen to test our hypothesis were Atari games (Assault and demon attack) and variants of the doom game.

Below is a summary of the modifications made to the architecture to adapt it to discrete action space.

Supporting stacked frames

The paper “Playing Atari with Deep Reinforcement Learning” by Mnih et al uses a history of 4 frames and stacks them to provide input to the Deep Q Network. The experiments that we carried out to test our hypothesis involves training a Q-function using IRL, and this implies that the IRL architecture must support stacked frames. We made changes to the RLLAB Gym Environment to include this.

Downsampling images

To train the reward network, we also included a function to downsample images as follows:

Image preprocessing involves converting the image to grayscale and resizing image to 84 x 84 (IMG_SIZE=84). Additionally, CycleGAN training also involves the use of Gaussian Blur for preprocessing images.

Policy Network

The paper experimented on games with continuous control and low dimensional observation space. The Policy Network used in AIRL paper was a Gaussian MLP Policy. To account for the high dimensional image state of these games, we needed to modify the Policy Network to a Convolutional Neural Network.

We borrowed the architecture from the model architecture in [9]. The details of the network are as follows:

Graphic visualization of the policy network

Reward Network

The Reward Network described in the AIRL paper was a ReLU-Net with a few FC Layers with ReLU activations. Similar to Policy Network, we had to modify the Reward Network to a Convolutional Neural Network as well. This network is used to learn a mapping function from (state, action) to reward, which is our main focus as we wanted to extract the reward function of a given game using this reward network.

The details are as follows:

Concatenating state, action as input to the network wouldn’t work in our case as the dimension of the state is much higher than that of the action. To solve this problem, we decided to use an architecture similar to the policy network, but with a few fundamental changes in it:

The image/state is passed through a few [Convolution + ReLU] blocks which reduces the state to a low dimensional representation (latent representation), and then the action (one-hot encoded) is concatenated to this flattened latent representation. This combined vector is then passed as an input through a fully connected network and the output nonlinearity is linear.

This makes it easier to understand:

Baselines and Regularization

As described in the AIRL paper, we need to train both these networks together. The algorithm used to optimize the policy network is TRPO (Trust Region Policy Optimization). For better training of Policy using TRPO baselines are used.

The authors use a Linear Regressor for baselines but in our case, it doesn’t work. So, we used convolutional Regressor instead. This takes input as the state and outputs a single value which is used along with the reward to calculate the advantage metric. Now, using this advantage instead of just the reward helps the TRPO to perform better. We have performed experiments to test the impact of using baselines.

It is a general issue in the AIRL setup that the gradient values while training the pipeline tend to become too small and the networks don’t learn. So there is a high possibility of loss hitting nan during training. So to avoid this, we can use baselines (for TRPO) along with regularizer for the reward network which cripples the network and helps maintain a decent gradient flow. We have used a simple L2 regularizer on the reward network.

The First Hypothesis

Can we train a DeepQ Network on DemonAttack, a game similar to Assault in action space, using the rewards obtained from the Reward Network of Assualt?

Establishing Visual Similarity between Assault and DemonAttack

An important challenge in this herculean task was to map the observation space of one game (Assault) to another (Demon Attack). To overcome this, we relied on 2017 paper out of Berkeley, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [8].

It was interesting because it did not require paired training data — while an x and y set of images are still required, they do not need to directly correspond to each other. In other words, if you wanted to translate between sketches and photos, you still need to train on a bunch of sketches and a bunch of photos, but the sketches would not need to be of the exact photos in your dataset.

This was crucial to our use case since it’s harder to find or create paired training/mapping between these games, unsupervised capabilities of CycleGAN seemed useful.

The CycleGAN

CycleGAN uses two Generators and two discriminators. In our use case, one generator will try to convert observation space/image from Assault to Demon Attack while another generator will convert from Demon Attack to Assault. The discriminator meanwhile classifies between synthesized generator image and the original image.

For a more in-depth understanding of CycleGAN working, architecture and objective function, refer to this post.

Instead of directly using CycleGAN, we defined our custom preprocessing step which assigns more attention to the object of importance using OpenCV.

Preprocessing images before passing to CycleGAN

For CycleGAN, we relied on the author’s implementation of CycleGAN where we replaced horse2zebra folder by our preprocessed sets of images and followed instructions as given on [repository](Link to Code).

Converting states of one game to another using CycleGAN


1. To check which Variant of IRL gave high returns.

In order to extract rewards from policy, we compared 3 variants of IRL namely — AIRL with no baseline, AIRL with baseline and AIRL with regularized Reward Network & Baseline. The use of baseline here was made under the assumption that it’ll reduce variance in training and resultant policy network will output consistent and dense rewards. For AIRL with regularized reward network, we ended up using L2 regularization with a value of 0.01 for this comparison. One problem that we faced was that the training of the policy network leads to many gradients going to zero. For that, we used baselines which helped to stabilize the training and helped reduce the number of gradients going to zero.

The first graph was in-line with our basis of a highly regularized network giving consistent and high rewards as compared to others.

Policy Returns from different variants of AIRL model

As expected, the second graph showed that the training time for AIRL with baselines was more as compared to AIRL with no baseline which attributed to time baselines take reduce variance and regularized network.

Time required per iteration for each variant

The use of baselines helps in stabilizing the training of policy network, as the gradients are otherwise close to zero and it doesn’t learn. This stability, in turn, helps to learn a better reward function

2. Comparison of Naive DQN v/s our DQN using Reward Combination

In this experiment, we want to compare environment rewards given by naive DQN against a DQN trained on AIRL rewards extracted from expert trajectories on trained A2C Assault policy. Before comparing this, as a sanity check, we trained DQN on rewards of AIRL with no baseline, with baseline and regularized reward network (with baseline) to check if it’s in line with initial experiment results.

Comparison of scores obtained from each variant per epoch

The above graph shows environment rewards of DQN trained on 3 variants of AIRL and the results confer to our initial experiments results that AIRL with regularized reward network (with baseline) gives higher return and stable training (fewer oscillations).

Now, we compared environments with Naive DQN.

Comparison with Naive DQN

The above graph shows that there is no significant improvement in results as compared to Vanilla DQN. But, during multiple training occasions, we saw our DQN outperforming vanilla DQN — this nature was not consistent.

3. Impact of using CycleGAN

CycleGAN is used to map the input space of Assault to Demon Attack since our IRL network is trained on Assault images. In this experiment, we trained a DQN which is trained on rewards obtained by passing CyleGAN output to reward network and one which straightaway passes Demon Attack Image to network.

Training DQN using with and without Visual Similarity component

The above graph shows that there is not much improvement in using CycleGAN, but, we believe this is because the two games are already very visually similar. In the case of dissimilar games, such conversion can be of good use.

The Second Hypothesis

In this hypothesis, we would like to test whether two reward functions trained on two different games, can be used to train an agent on a game whose action set is a superset of the action sets of the two games mentioned before.

Our intuition behind proposing this is simple: If a child learns to play a game which involves driving (e.g., NFS), and another game which involves shooting (e.g., Counter-Strike), then it becomes easier for the child to learn a game that involves both of them (e.g., PUBG).

Recalling the method we used for the first hypothesis, we used Adversarial IRL for extracting a reward function from the expert demonstration of a game and then used reward from that reward function in place of the reward gained from the environment for training an agent on another game. For our second hypothesis, we will be using a similar method. But there lies a problem with using the above method straight-forward: We are talking about using two reward functions from two different games, meaning there’s more than one source now for the reward to be used for training the agent.

Reward Combination

The question now is the following:

How do we ‘combine’ the reward functions of the two games so that we can use this ‘combined’ reward function on the bigger game?

To solve this issue, we propose a simple algorithm.


  • G1 and G2 are two games with state-action representations as (S1, A1) and (S2, A2), where A1 is the actions set for G1 and A2 is the actions set for G2.
  • R1 and R2 are reward functions extracted from expert demonstrations of G1 and G2.
  • G3 is the game we wish to train an agent using rewards from R1 and R2, with state-action representations as (S3, A3); where A3 ⊇ {A1, A2}.
  • π is the policy we wish to train on G3.

The Algorithm

Using the reward we get from this algorithm, we update π.

This algorithm can be derived from simple intuition. If the action chosen by the policy exists in only one of the games, the reward function for that game must be used. If it exists in both the games, we take the average of the reward functions; the reward functions abstract the effect of taking that action, and if an action is present in both the games then we give equal weight two both rewards. Similarly, if the action is not present in either of the games, we will use the environment reward, so as to teach the effect of that action from the ground up for that state.

Training Procedure

  • We first trained the reward networks for two different games using Adversarial IRL on the expert demonstrations of those two games.
  • We then used these reward networks to fetch rewards for a state of the target game, by passing the state of the target game and then using the reward combination algorithm mentioned above to get a reward value.
  • Using the reward we obtained above, we then train a Deep Q-Network agent on the target game.


For testing our hypothesis, we used the OpenAI Gym wrapper (Vizdoomgym) for variants of the game Doom.

The two games we used were VizdoomTakeCover and VizdoomDefendLine. Our target game was VizdoomDeadlyCorridor.

1. Policy returns of VizdoomTakeCover and VizdoomDefendLine

As mentioned in the training procedure above, we first trained two reward networks for the games. The Policy returns from the graphs are plotted as follows:

Policy returns for VizdoomTakeCover (left) and VizdoomDefendLine (right)

The ‘Original Task Average Return’ indicates the average score obtained from the trajectories generated by the policy. It is an indicator of how well the reward function has trained by learning from the expert demonstrations.

2. Comparison of Naive DQN v/s our DQN using Reward Combination

Finally, using the trained reward networks and the reward combination algorithm discussed above, we train a DQN agent on VizdoomDeadlyCorridor. We used OpenAI baselines for training a DQN, modifying it slightly so that the rewards being used for training are the ones we have calculated, according to the algorithm. For comparison purposes, we also trained a DQN agent, using OpenAI baselines again, but this time we used solely environment rewards for training. A visualization of how the total score increased throughout the training is as follows:

As can be seen, the rewards calculated by the reward combination help gain a head-start against the vanilla-DQN using solely the environment rewards. The average final score on Vizdoom Deadly Corridor during inference using our method was ~240.

After training the reward function for 350 iterations and using it to train a DQN model for 300 iterations, the following episode was played by our DQN model (recorded at 10 fps) and it attained a score of 830 points:


  • For our first hypothesis, we found that the DQN trained on the transferred reward performed equally good when compared to DQN trained on purely environment rewards.
  • For our second hypothesis, we found that training a DQN with rewards obtained from the reward combination between two games helped faster convergence as compared to environment rewards.

Our Entire Code is committed here


This Blog was possible because of CSCI 599 Deep Learning course at USC by Joseph Lim and our Teaching Assistant Ayush Jain. Also, my teammates — Abhishek Ananthakrishnan, Sreekar Kamireddy, Tejas Dastane & Varun Rao.

References & Further Reading

  1. Ng et al, “Algorithms for Inverse Reinforcement Learning”
  2. Mnih et al, “Playing Atari with Deep Reinforcement Learning”
  3. Yan Duan, Xi Chen, Rein Houthooft, John Schulman, Pieter Abbeel, “Benchmarking Deep Reinforcement Learning for Continuous Control”, Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
  4. Fu et al, “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning”, arXiv:1710.11248v2 [cs.LG]
  5. Schulman et al, “Trust Region Policy Optimization”, arXiv:1502.05477v5 [cs.LG]
  6. Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning”, arXiv:1602.01783v2 [cs.LG]
  7. Schulman et al, “Proximal Policy Optimization Algorithms”, arXiv:1707.06347v2 [cs.LG]
  8. Zhu et al, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”, arXiv:1703.10593v6 [cs.CV]
  9. Mnih et al, “Human-level control through deep reinforcement learning”. Nature, 518(7540):529–533, 02 2015.

CS@USC, I write articles about things I learn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store