UC Berkeley Reward-Free RL Beats SOTA Reward-Based RL

Synced
SyncedReview
Published in
4 min readSep 21, 2020

End-to-end Deep Reinforcement Learning (DRL) is a trending training approach in the field of computer vision, where it has proven successful at solving a wide range of complex tasks that were previously regarded as out of reach. End-to-end DRL is now being applied in domains ranging from real-world and simulated robotics to sophisticated video games. However, as appealing as end-to-end DRL methods are, most rely heavily on reward functions in order to learn visual features. This means feature-learning suffers when rewards are sparse, which is the case in most real-world scenarios.

A new paper from researchers at the University of California, Berkeley addresses this issue with Augmented Temporal Contrast (ATC), a new unsupervised learning (UL) task for learning visual representations agnostic to rewards and without degrading the control policy.

ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference. Random shift, a stochastic data augmentation to the observations is applied within each training batch. Finally, the augmented observations are encoded into a small latent space where a contrastive loss is applied.

The ATC architecture consists of four learned components:

  • A convolutional encoder, which transfers observations into the latent image.
  • A linear global compressor, which produces a small latent code vector.
  • A residual predictor MLP, which acts as an implicit forward model.
  • A contrastive transformation matrix, which transfers the positive observation into the target code.

The researchers evaluated ATC on three visually diverse RL benchmarks — the DeepMind control suite (DMControl), Atari games in the Arcade Learning Environment, and DeepMind Lab (DMLab). They also used ATC to enhance both on-policy and off-policy RL algorithms.

In the online setting, ATC matched or outperformed state-of-the-art end-to-end Reinforcement Learning on all DMControl and DMLab environments, and in 5 of 8 Atari games tested.

The team also benchmarked a variety of unsupervised objectives to learn features, with ATC again matching or outperforming the state-of-the-art unsupervised representation learning algorithm for RL across all three environments.

In the offline setting, the researchers explored ATC’s capability to learn multi-task encoders, demonstrating that features learned by ATC enable efficient learning for both training and testing environments.

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning, with many naive approaches succumbing to exponential sample complexity. While reward-free representation learning provides flexibility and insights for improving deep RL agents, as an unsupervised approach it of course lacks the information used to guide the training direction of supervised, reward-based RL approaches. The proposed ATC marks a significant milestone, as the first time RL trained on unsupervised features has matched or outperformed SOTA end-to-end RL.

The paper Decoupling Representation Learning from Reinforcement Learning is on arXiv.

Analyst: Hecate He | Editor: Michael Sarazen; Yuan Yuan

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global