Reward is NOT Enough, and Neither is (Machine) Learning

Walid Saba, PhD
ONTOLOGIK
Published in
7 min readJun 29, 2021

This is a short and critical commentary on a recently published paper entitled “Reward is Enough”, the main thesis of which is that most, if not all, intelligent behavior is the result of a generic objective of maximizing reward. We believe the paper’s thesis and the subsequent claims are false, and precisely because they are based on false assumptions. We will argue instead that reward is not enough, by making the stronger claim that learning itself (in all its paradigms) is not enough. Finally, we will argue that it is about time to put an end to such zealousness in pursuing empirical methods as the only alternative to achieving artificial general intelligence (AGI).

Introduction

A paper entitled “Reward is Enough” has recently appeared in one of the oldest and most prestigious artificial intelligence (AI) journals. The paper’s main thesis is that most, if not all, intelligent behavior by natural or artificial agents can be explained by the generic objective of maximizing reward. In many respects this paper is a manifestation of how extreme empiricism has become since the statistical and empirical methods overtook AI by a storm in the mid-1990s.

While the “Reward is Enough” paper is one example of the many misguided claims made by proponents of empirical methods, it is certainly not a unique situation. The Turing Award winner Geoff Hinton has recently made statements such as “deep learning is going to be able to do everything” . The reason I singled out this paper is that it was accepted for publication in a journal that I used to consider “the reference” to quality research in AI.

We will not be concerned here with the larger question of the pros and cons of the empirical revolution in AI, at least not directly and certainly not into any depth. Instead, our focus here will be in highlighting the dangers of perpetuating this zealous trend in our scientific quest for building intelligent machines. The “Reward is Enough” paper (henceforth, RiE): (i) mischaracterizes the type of knowledge an agent needs to exhibit intelligent behavior; (ii) ignores the significant difference between learned and acquired knowledge; and (iii) assumes that intelligent behavior can be almost exclusively cast as learning survival skills that are probably true in the animal species but ignores the more advanced cognitive abilities of humans. Together, these false assumptions lead, naturally, to false conclusions. In what follows I will therefore first argue that all of these assumptions are false. I will then briefly discuss why this leads us to the conclusion that, not only “Reward is Not Enough”, but that all of what we call (machine) “learning” is not enough in our quest for artificial general intelligence (AGI).

A ‘Reward’ is not always Computable

Before I get into where the RiE made a false assumption, it will be helpful to first suggest why the most important assumption they make is false. Let us do this by trying to apply the “Reward is Enough” method on learning a fact that is well established in Computer Science. Computer scientists all know that we can sort a collection of objects of size n in O(n log n) steps. To cast this problem in “Reward is Enough” (reinforcement learning) paradigm we would have a situation that is described in Figure 1 below:

Figure 1. A “Reward is Enough” casting of an agent that is trying to learn the optimal sorting algorithm.

In figure 1 the environment (or at least the relevant part to our current learning task) is a graph of sorting algorithms. An edge from algorithm r to algorithm m corresponds to going from algorithm r to algorithm m by performing a semantically preserving transformation (using identities, equivalences, etc. Note that a cycle in this scenario corresponds to a trivial transformation, namely a transformation that is the result of applying the inverse of the operation that got us to this node.) The algorithm in green corresponds to the fact that the current state of the environment points to algorithm i as the best sorting algorithm. The agent’s action, in this case, could be something like “apply the commutativity of addition transformation”. Such an action might change the state of the environment by, perhaps, changing the currently chosen algorithm from algorithm i to algorithm k, returning the proper new state to the agent. All of this, so far, sounds good. But, what is the reward? How do we inform the agent that their action improved the situation or not? Here, we have two (unsolvable) problems.

In typical reinforcement learning settings the computation of the reward is well defined — in fact, that’s what the agent uses to adjust their policy and take a new action that improves the situation. This works well for knowledge that we actually learn incrementally —which are mostly animal-like skills, like seeking food when hungry, or in more advanced settings like games where the goal is also well-defined, namely winning the game, etc. When the goal is well defined, a “reward” can be computed. In games like Go (or Chess), for example, one naïve way to compute the reward would be to apply some function on the current board that returns a numeric value of how “close” the current board is to winning. But how would we compute the reward in our problem above? The answer is we cannot. First, there are infinite number of semantically equivalent sorting algorithms — just think that we can always append an empty list to a list, or add 0 to some number without changing the semantics of a program. Worse yet, how do we score a certain algorithm that we reached to determine if the agent’s action improved its search for the optimal algorithm? The only way we can do this is by executing the program we are now at, but this immediately means we are face to face with the halting problem (what can we conclude if the program never terminated while we are evaluating the algorithm?)

In short, the reward in this problem is not defined — and thus

we cannot discover the best sorting algorithm in a reinforcement learning and “Reward is Enough” paradigm

In fact, the best sorting algorithm, like many important human knowledge, is not discovered by incremental learning, but by reasoning and deduction.

Neither Reward is Enough Nor Learning

I went through the above example just to highlight the fact that not all of our knowledge is learnable, and precisely because learning many types of knowledge is not incremental where there is a well-defined reward that can be computed at every step.

This brings me back to the RiE paper and the most important (and flawed) assumption they make, namely that “… knowledge may be innate (prior knowledge), while some knowledge may be acquired through learning.” This is a flawed assumption, and from false assumptions, as logic taught us, one can reach any conclusion they desire. But why is this a false assumption? In a previous post on this medium I discussed in detail why learning is overrated, where I discussed the difference between knowing-how and knowing-that. While knowing-how refers to skills and abilities that are surely learned (in the machine learning sense of learning), knowing-that cannot be learned since it is not the type of knowledge that can be susceptible to individual experiences and observations. The fact that the BiggerThan relation is transitive cannot be learned since it cannot be learned differently. In fact, most commonsense naïve physics — which is knowledge about how the world works, is knowledge of this sort. Such knowledge cannot be subject to the constraints of PAC (probably, approximately correct) learning paradigm. In fact, results in computational learnability itself make this formal argument. The number of examples m needed to learn with error less than epsilon and with certainty 1 — delta is given by

Thus to learn with error 0 and/or with absolute certainty (which is the situation of most of our important knowledge), we would require more than infinite number of examples, which of course means that learnability in this case is not an option. This applies to most of our factual knowledge that we actually “acquire” by discovery, by deduction, or by instruction (i.e., by being told).

Concluding Remark

There are many kinds of learning: learning by observation, learning by experience (by trial and error), learning by analogy, learning by instruction (by being told/taught), etc. Most of the consequential factual knowledge that we use to function and perform commonsense reasoning is knowledge that is not incrementally learned, and thus reinforcement learning is not even relevant, since a “reward” cannot be defined.

Time to stop making grandiose statements (like “Reward is Enough”) — and, time for media to stop perpetuating false narratives. Certainly, respectable journals that we always consulted to look for quality research, should not just ride the runaway train by publishing research that falls short of being scientifically valid.

___
ONTOLOGIK — Medium

--

--