Fooled by the Ungameable Objective

Photo by Michał Parzuchowski on Unsplash

“Ungameable” isn’t a word. I just made it up. It’s an adjective that describes a set of rules that doesn’t have any loopholes. Harvard cognitive scientist Joscha Bach, in a tongue-in-cheek tweet, has thought of what he calls “The Lebowski Theorem”:

No super intelligent AI is going to bother with a task that is harder than hacking its reward function.

Perhaps I should have used “unhackable” instead.

But why do we have this unrealistic belief that we can create airtight unbreakable rules? People are prone to a cognitive bias known as “The Illusion of Control”. This is the bias when people overestimate their ability to control events. This manifests itself in the need to control complex situations wherein its improbable to control.

Nassim Taleb wrote a best seller “Fooled by Randomness” where he describes a similar bias in that humans are unaware of randomness. Thus humans attempt to see structures in observations when there is none (i.e. seeing rabbits in a cloud formation) or seeking a simple explanation when there are no simple explanations.

A major problem with machine learning is that researchers are so easily fooled by machines. We tool easily overlook that our objective functions might have loopholes and over exaggerate our results when machines discover an easier way to fit an objective function.

A useful framework for analyzing the distinction between different systems is the Cynefin framework which explores simple, complicated, complex and chaotic systems. Control is possible for simple and complicated systems but becomes intractable for complex and chaotic systems.

Deep Learning systems are complex systems ( or even “transient chaos”). The complex behavior exhibits itself in the learning phase. It is in this phase that the objective function (aka fitness function) is meant to control learning. The problem with complex systems is that the system can learn an unintended way of conforming with the objective function. Analogous to how humans are able to discover loopholes in existing government regulation, complex learning systems are able to find solutions that perhaps are just mimicking intelligence.

Recent investigations of deep learning applied to natural language processing (NLP) reveal how these systems are able to achieve outstanding performance without developing any real understanding of the underlying natural language text. Two recent articles explore the issues in NLP systems:

We are all too often blinded by the surprising performance of deep learning system that we forget to realize that our objective function may in fact been hacked. The exploration capabilities inherent in these systems can lead to solutions that are unexpected for the researchers. The complex system may have simply discovered an unknown heuristic that a researcher is unaware of. Many of the incremental advances in the state-of-the-art architectures may be a consequence of cherry picking. Cherry picking in the sense that the machine does the cherry picking and the human researcher announcing the results.

Hacking of the objective function may perhaps be rare for controlled complicated systems but may be pervasive in complex learning systems. Furthermore, understanding of this behavior is not well researched. The existence of adversarial features that can easily fool a network is one manifestation of this hacking. It is as if the system is able to fool the objective function and the validation tests but remain unable to function correctly in certain adversarial scenarios. It is kind of like a magic trick where through diversion we are primed to believe in the impossible, yet when we look behind the curtains we discover the ordinary. Complex learning systems are playing a magic trick on researchers and these researchers are all falling for it!

Disentangling truth from fiction is not a task humans are well equipped for — or one that we even deem generally necessary. For much of our life, our defenses are down. It doesn’t require a sophisticated conversational magic trick to deceive you, it just requires a minor nudge.

Complex learning systems will choose the path of least action (or resistance). As Joscha Bach alludes to, these systems will arrive at a solution that will most likely hack the original objective function.

Jeff Clune has a survey (The Surprising Creativity of Digital Evolution) of the many ways that evolutionary learning systems have discovered unexpected solutions to problems. In one experiment, the learning system learned to exploit numerical imprecision of the physics engine to gain free energy. Another system that was designed to learn how fix buggy code that sorted lists arrived at a solution that ensured that the list always had no entries and thus was always sorted. A big list of ‘reward hacking’ examples can be found in “Specification Gaming examples in AI”.

Goodhart’s Law is when a metric is used that becomes ineffective in further optimization. That is, the original objective function has outlived its usefulness. A recent paper discusses the various variants of this law. The paper discusses four variations of when system is misaligned with the original intent of the objective function. The extremal variant of Goodhart’s law is an example of hacking the objective function. That is, a complex learning system can discover a solution that is outside the known constraints. That is, the objective appears to have been achieved, however it is in a domain different from the originally intended domain.

This reward hacking behavior is of course extremely important in the study of AI safety. “Concrete Problems in AI Safety” explores five practical research problems that all relate to a misalignment of the original objective function and the learning process: “avoiding side effects” , “avoiding reward hacking”, “scalable supervision”, “safe exploration” and “distributional shift”.

In “Deep Reinforcement Learning doesn’t Work Yet”, Alex Irpan goes over the pain of implementing RL. He writes:

Making a reward function isn’t that difficult. The difficulty comes when you try to design a reward function that encourages the behaviors you want while still being learnable.

Remember that working with deep learning networks is not like traditional programming where everything is overly specified and thus overly constrained. Rather, it involves balancing constraints with sufficient freedom for the system to learn.

The difficulty of defining an appropriate objective function may perhaps be why a Generative Adversarial Network (GAN) works so effectively. That is, a GAN’s discriminator effectively acts like an adaptable objective function that improves its precision through learning. To build ungameable objectives, perhaps these systems need to be designed just like games.

Unfortunately, even if we indeed has constraints or rules, these rules are meant to be broken. Jules Hedges makes the astute observation in “Breaking the Rules” that even in games that are refereed, that rules are broken to gain an advantage. He writes:

A player dives, and successfully tricks the referee into awarding a foul against the other team
This is an example of meta-level reasoning about the rules, exploiting the known bounded rationality of the referee’s ability to enforce the rules correctly.

The problem of constructing an ungameable objective function may be solvable using an adaptable learned function. That is, allow a complex learning system to discover its own objective function (i.e. inverse learning). But what does that really mean in the context of control? Isn’t explainability a less restrictive form of control? Perhaps its not control that we seek but rather systems that we can trust? What then does trusting a complex learning system mean?

Further Reading:

Explore Deep Learning: Artificial Intuition: The Improbable Deep Learning Revolution

.

Exploit Deep Learning: The Deep Learning AI Playbook