BCQ with a GAN

AurelianTactics
aureliantactics
Published in
2 min readJan 12, 2021

There’s been a lot of interesting stuff in the field of batch Reinforcement Learning (aka offline RL) since I wrote about implementing BCQ in TensorFlow. In batch RL the agent is trained on an already collected data set and does not interact with the environment. The hope with batch RL is that we can experience a data driven surge in RL like we’ve seen in Supervised Learning. The Berkeley AI Research lab seems to come out with a new batch RL algorithm every month. DeepMind (RL Unplugged) and BAIR (D4RL) have recently released benchmarks and datasets for driving progress in the field. If you’re interested in batch RL, I’d recommend starting with this medium post by BAIR’s Sergey Levine and browsing the BAIR blog to see some of their research.

Compared to the work others are doing this post is a bit of a waste, but I’d like to share what I’ve been working on. The basics of this post is I replaced the VAE in BCQ with a GAN to unimpressive results. Partly inspired by the lead author of BCQ, Scott Fujimoto, discussing BCQ and the idea of using a GAN with BCQ in a TalkRL podcast I decided to give it a shot.

I used a Conditional GAN (CGAN), since my GAN discriminator and generator were used to create an action, conditioned on a given state. I tried the Wasserstein loss with gradient penalty (WGAN-GP) and spectral normalization (SN-GAN) ablations in my experiment. I had decent results on Pendulum and LunarLander but my BCQ GAN struggled to get any results on BipedalWalker. If — for some reason — you want more details on the experiments here’s a link to my code and a write up I did for a class.

The different lines are me trying different types ablations with my BCQ GAN like combining the Actor-Critic networks into the Generator-Discriminator networks and using Random Network Distillation (RND) to help with action selection. Nothing really worked that well compared to the original BCQ.

--

--