Curious Agents: An Introduction

Dries Smit
InstaDeep
Published in
4 min readMay 21, 2023

Welcome to the first entry in a series of posts investigating self-supervised learning in open-ended environments. Some argue that learning in a self-supervised manner is essential if we ever want to create more intelligent agents that can solve real-world problems. We will cover topics such as curiosity, self-supervised learning, BOY, the JEPA architecture, as well as investigate some recent successes in the field such as using curiosity in hindsight. Alongside each post, the code will also be open-sourced for you to play around with these ideas yourself.

Plan2Explore. Source.

The success of ChatGPT

Last year (2022), ChatGPT took the world by storm. While the technology behind ChatGPT has been brewing for quite a while, it took combining a large amount of data, compute and human feedback to finally create a chatbot that is truly useful. And a very talented team of course.

ChatGPT is essentially a large transformer model trained to predict the next token in a text corpus. OpenAI’s GPTs (Generative Pre-trained Transformers) are models that are trained on a large amount of text data sourced from the internet. OpenAI showed that if one scales up the model and dataset sizes one can achieve impressive results. This is not a new observation. Rich Sutton famously wrote that:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

This has been proven again and again in the AI field. ChatGPT’s architecture can be broken down into three steps. First, the base model is fine-tuned using human demonstrations, then a reward model is trained to rank model outputs, and lastly, reinforcement learning is used to adapt the model to output text that maximises the reward model.

Learning from Human Feedback (RLHF) Source.

This is an impressive pipeline that achieves impressive results. It successfully builds on a trend of first training large foundational models (usually with unsupervised learning) and then fine-tuning them to solve particular downstream tasks.

Scaling reinforcement learning

Reinforcement learning has had a lot of success stories in the last couple of years. Deep Q-Learning, AlphaGo, OpenAI Five and AlphaStar defeated the top human players in Atari, Go, Dota 2 and StarCraft II.

Recently, DreamerV3 even managed to collect diamonds in Minecraft purely through reward-driven exploration.

Minecraft is a large open-world environment with no singular goal. In the DreamerV3 work, the authors provided a reward structure that incentives their agent to learn skills necessary to be able to mine a diamond. But in theory, any number of challenging goals could have been selected.

Minecraft. Source.

DreamerV3 requires a continuous external reward signal to be able to learn. This is great for simple tasks in Minecraft but makes it much more difficult to scale to more complicated tasks, such as killing the Ender Dragon, where the reward would be extremely sparse. Furthermore, it becomes much harder to design reward models for real-world environments such as general computer interfaces or real-world robots.

Therefore, prominent AI researchers such as Sergey Levine and Yann LeCun have argued (here, here and here) for the use of self-supervised learning to effectively create agents that can learn in these more complicated environments. Self-supervised learning, in the context of agent-environment interactions, can be described as learning from raw observations without requiring external rewards. This might seem like a weird statement. How can one learn without a reward function? It turns out that it is possible to define optimisation objectives that encourage the agent to explore the world to try and model dynamics found in the environment.

Curiosity

One line of research in the self-supervised learning space attempts to instil intrinsic curiosity into agents. Researchers such as Jürgen Schmidhuber has long been a proponent of this line of research. Similar to the next token prediction used in the GPT series, world models that predict future observations have been proposed. A simple curiosity-based agent attempts to find observation sequences that the world model cannot accurately model, while the world model attempts to minimise the error in its predictions. This results in agents that learn to explore their environment to find new interesting dynamics.

Jürgen Schmidhuber’s curiosity archive.

In the next post, we will start exploring these ideas further by training our first agent inside the MountainCar environment. Then we will move on to more state-of-the-art techniques and investigate what the future might hold for this exciting field.

--

--