Curiosity, Reward Sign Bias, and Political Orientation in Reinforcement Learning

robin ranjit singh chauhan
5 min readJan 12, 2019

A common metaphor to explain exploit/explore tradeoff in bandit problems is that you are in a new town (say Montreal), and have tried 2 of the cafes so far. For your next meal, you could either:

  • Exploit: go to a cafe you know you like, with reward=0.5, or
  • Try a new cafe (explore) in the hopes you like it even more, maybe reward=1.0, but with a risk you may like it less with reward=0.1

So we talk about things like bandits with optimistic default values to encourage exploration so we can minimize regret. Optimistic default values in this scenario make sense because the cost of choosing a bad cafe is not great — the worst thing that could happen is you don’t love your sandwich. To ensure complete exploration, the default value should be equal to the largest actual reward (to fully explore even after finding the best cafe, the default should be larger than the largest reward).

But consider this variant: you are a traveller lost in the European countryside in Medieval times. You are extremely hungry and have a weak immune system. You have coins, and there are a few restaurants to choose from, but you know most of them do not have good hygiene and you cannot afford to get sick. One bad choice could result a catastrophic outcome that you are unlikely to ever recover from in terms of total regret (death; reward=-100 and end of episode).

Related image
Choose carefully, these Medieval restaurants are dangerous!

Do you want optimistic default values to encourage exploration in this scenario? In fact, pessimistic default values may make more sense to avoid regret: if you find even one restaurant that does not kill you (reward=0.1) — just keep going there!

Curiosity and Reward Bias in Atari

In the wonderful paper “Large-Scale Study of Curiosity-Driven Learning” (Burda et al 2018) presented at NeurIPS 2018 Deep RL Workshop, the authors found agents imbued with “curiosity” (rewards given for encountering unpredictable states) naturally did well in many Atari games.

54 environments investigated in the paper. From https://pathak22.github.io/large-scale-curiosity/

But how did curiosity deal with death? The authors briefly touch on this, I quote from the paper:

death is just another transition to the agent, to be avoided only if it is boring…

…when the agent runs out of lives, the bricks are reset to a uniform structure again that has been seen by the agent many times before and is hence very predictable, so the agent tries to stay alive to be curious by avoiding reset by death.

— Large-Scale Study of Curiosity-Driven Learning, Burda et al 2018

Yet death in Atari usually is quick and predictable process — but not always so in real life. Some terminal illnesses last for years, and can involve complex treatments regimes. And who is less predictable than a dementia patient? Curious agents playing “The Game of Real Life” would be driven to explore these deep, low-reward rabbit holes, and morbid curiosity could encourage the agent to spend lots of its resources exploring there.

This suggests curiosity should not always be expected to correlate with success. These curious agents succeed in Atari games, because Atari games are rather optimistic, “curiosity-friendly” environments.

Leo is chilling

Contrast this with the famous Tolstoy quote:

Happy families are all alike; every unhappy family is unhappy in its own way.
— Tolstoy, Anna Karenina

Unlike in Atari, curious agents in this context would morbidly explore a lot of family relationship anti-patterns, being rewarded since the complexity of unhappy states are less predictable, and largely ignore happy family states as they are both more predictable and thus “boring”.

Now this exploration of pathology might actually be quite helpful if the goal was to gain an understanding of dysfunctional family. But if the goal is to chart a path to a healthy life, curiosity will not (directly) help us get there the way it does in Atari.

Liberals Explore, Conservatives Exploit

These two regimes correspond to political viewpoints as described so clearly by Jonathan Haidt in his TED talk (quotes below are his):

Im saying these animals have different priors on rewards
  • Liberal / Explore: “If you’re high in openness to experience, revolution is good, it’s change, it’s fun.”
  • Conservative / Exploit: “The great conservative insight is that order is really hard to achieve. It’s really precious, and it’s really easy to lose.”

You might say that common RL environments like Atari have a liberal bias, due to there being no catastrophic states with large negative rewards that could permanently increase regret. That means exploration is not to be feared, so let’s try things, and there could always be something better around the corner. A relatively high default reward reflects our optimism. There are also no deep rabbit holes with low or negative rewards, so curiosity can be helpful. This makes sense, since they were designed to provide short sessions of fun.

But in the Medieval restaurant case, a conservative approach would make more sense: exploration is very risky, so let’s stick with what we already have found works well. A low (or negative) default reward reflects our pessimism. Curiosity is not helpful here.

Clearly the right strategy depends on the context, so understanding the reward bias in your context is key. Generally speaking, “liberal-friendly” environments sure make for more interesting policies! Ideally we could recognize which regime we were in, and act appropriately.

Reward bias in Toy Worlds vs Real Life

RL researchers say that the ultimate goal is not solving toy problems like Atari, but rather thorny real world problems (like say, climate change).

But unlike Atari where failure is quick and simple, real world problems tend to have a lot of complicated, protracted ways in which they could be made a lot worse, perhaps even permanently.

RL today is very focused on toy problems that are fundamentally different than real life in this respect. If researchers are serious about tackling real world problems, perhaps this difference is worth looking at in more depth.

One response to this may be to say: RL is still in early stages, so such toy problems are still appropriate. I don’t disagree with that, but if we want people to believe this research is aimed at meaningful problems, then perhaps we could consider toy problems that better emulate these key properties of real problems we hope to one day solve.

If you enjoyed this article, adding a clap or two below means a lot!

--

--