Active Reinforcement Learning through Episode Annotation (ARLEA)

John Maxwell
23 min readMar 30, 2018

In his 2011 blog post “The Urgent Meta-Ethics of Friendly Artificial Intelligence”, Luke Muehlhauser wrote:

Barring a major collapse of human civilization (due to nuclear war, asteroid impact, etc.), many experts expect the intelligence explosion Singularity to occur within 50–200 years.

That fact means that many philosophical problems, about which philosophers have argued for millennia, are suddenly very urgent.

If a near-future AI will determine the fate of the galaxy, we need to figure out what values we ought to give it. Should it ensure animal welfare? Is growing the human population a good thing?

Luke’s post reflects a view that seems common among people worried about AI risk: AI safety research is in a race with AI capabilities research. If we have AI capabilities knowledge without AI safety knowledge, disaster will follow. Therefore, we need to discover AI safety knowledge before we discover AI capabilities knowledge.

I’d like to suggest a slight modification to this view. Instead of seeing AI capabilities and AI safety as distinct research areas, I propose that we view AI safety as the problem of using AI capabilities to produce a safe AI.

Why might this modification make sense?

First, Luke himself writes that AI research appears to have a stronger track record of producing philosophical insights than philosophy does.

Second, if work on AI safety is disjoint from work on AI capabilities, then creating a Friendly AI will require awkwardly grafting together insights from two paradigms: the safety paradigm and the capabilities paradigm. This will make FAI harder to develop and less reliable.

Remarkable parallels between moral philosophy and machine learning

As a case study in using AI capabilities to solve AI safety, let’s consider how we might solve the challenge posed in “The Urgent Meta-Ethics of Friendly Artificial Intelligence” using machine learning techniques. Here’s Luke again, this time in 2017, on how he extrapolates his moral intuitions to discover his values:

…I interpret my current moral intuitions as data generated partly by my moral principles and partly by various “error processes” (e.g. a hard-wired disgust reaction to spiders, which I don’t endorse upon reflection). Doing so allows me to make use of some standard lessons from statistical curve-fitting when thinking about how much evidential weight to assign to particular moral intuitions.

This passage links to an extended footnote, which quotes Nick Beckstead’s thesis on moral intuitions in the context of Bayesian curve-fitting:

…For any consistent data set, it is possible to construct a curve that fits the data exactly… If the scientist chooses one of these polynomial curves for predictive purposes, the result will usually be overfitting, and the scientist will make worse predictions than he would have if he had chosen a curve that did not fit the data as well, but had other virtues, such as a straight line. On the other hand, always going with the simplest curve and giving no weight to the data leads to underfitting

Our moral intuitions are the data, and there are error processes that make our moral intuitions deviate from the truth. The complete moral theories under consideration are the hypotheses about the phenomena.

So two of the most prominent ethical thinkers in the EA community endorse a meta-ethical process that looks very similar to what a machine learning expert does when they try to model a dataset with noisy labels, complete with a discussion of overfitting and underfitting.

“Uncertain moral judgments as noisy labels” is not the only correspondence I see between moral philosophy and machine learning. Here are some others:

  • Coming up with a consistent set of moral principles which satisfies our intuitions about morality has proven difficult. And since our moral intuitions were shaped by incidental evolutionary pressures, there’s no particular reason to believe that such a set of principles exists. Nick Bostrom has proposed the idea of a “moral parliament” to deal with this problem. Bostrom’s parliament is closely related to the idea of an ensemble in a machine learning context. In the same way the models in an ensemble vote on how to classify a training example, Bostrom’s parliamentarians vote on the morality of some action. Bostrom’s parliament needs parliamentarians that do a good job of capturing various subsets of our moral intuitions, and boosting methods let us train an ensemble of models that perform well on various subsets of a training set.
  • People from different time periods possess different moral intuitions. This makes the task of a descriptive moral philosopher akin to that of an online machine learning algorithm. As new data comes in, the model needs to be updated to maintain its predictive accuracy.
  • Judges reason from precedent to decide legal questions. The analogy here is to nonparametric methods like k-nearest-neighbors which work from nearby training examples with known labels.
  • Simple models tend to generalize better beyond the scope of the training set. This could be an argument for favoring simpler moral theories when reasoning about unfamiliar choices. The popularity of exotic thought experiments in the EA community may explain why EAs tend to favor high bias, low variance moral theories that “bite bullets”.
  • Bertrand Russell once said: “[T]he point of philosophy is to start with something so simple as not to seem worth stating, and to end with something so paradoxical that no one will believe it.” This process might be considered analogous to active learning in a machine learning context: Identify a simple model which appears to fit the data well, extrapolate it, and query the user about the extrapolation.

Because of these analogies, I’m optimistic about the possibility of developing a system that outperforms humans at descriptive moral philosophy using future, more powerful machine learning techniques. Such a system could be superhuman in its ability to pose incisive moral thought experiments and flag mistaken moral judgments (analogous to flagging mislabeled training examples). In the same way systems based on judgmental bootstrapping can offer better predictions than human experts, a system like this might offer better predictions than human philosophers. In the same way ML techniques sometimes find simple models humans don’t see, a moral reasoning system based on powerful ML techniques might be able to discover a satisfying moral theory where humans have failed.

Plugging AI morality into a Markov Decision Process

Having established correspondences between epistemic morality & machine learning, let’s move on to instrumental morality. For this section, we’ll assume that we can use ML techniques to approximate human preferences, but the approximation is not perfect. How can an AI operate based on an imperfect approximation of our preferences without causing a catastrophe?

We’ll make use of the Markov Decision Process (MDP) formalism, which serves as the foundation for reinforcement learning, a popular AI planning technique. There are two main problems we’ll tackle:

  • First, the AI’s approximation of its overseer’s preferences is imperfect. We’d like the AI to request clarification dynamically whenever it’s needed.
  • Second, we don’t want the AI to be a pure consequentialist. Purely consequentialist systems might be corrigible in theory. But in practice, we’d also like the AI to obey ethical obligations like “never deceive anyone” for the sake of redundancy.

To solve the first problem, the states of our MDP will consist of ordered (preference beliefs, environment) pairs. This way, the agent’s utility function is not just defined in terms of the environment. The same change to the environment could be positive utility or negative utility, depending on the AI’s beliefs about our preferences.

The second problem requires some thought. Suppose we train a classifier to review the AI’s plans and reject plans which look like deceiving the programmers. Then the classifier will need a threshold for deciding whether to accept or reject. If its threshold is too restrictive, the AI won’t be able to do anything useful. If its threshold is too permissive, you get the nearest unblocked strategy problem, where the AI modifies forbidden plans so they are barely able to squeak by.

The problem becomes easier if an AI’s utility and its ethical obligations are denominated using a common currency. Then we can assign plans an added cost equal to the classifier’s estimated probability that the obligation is being violated times a number representing how bad it is to violate this obligation. The result is an AI that’s scrupulously honest almost all the time, and only deviates from extreme honesty when there is no other way to achieve something important.

To plug this into the MDP framework, we’ll say that all the rewards for taking action in the MDP are negative, except the reward for the null action of sitting still, which is 0. This reflects the idea that the AI should default to inaction. The reward for actions we don’t like, such as guarding the off switch or deceiving the programmers, will be very large negative numbers. I’ll refer to all this as the “deontological” aspect of the AI’s morality.

In order to motivate the AI to act, we’ll need positive utility somewhere. The “consequentialist” aspect of the AI’s morality will correspond to “intrinsic value” that is placed on various states in the MDP formalism. I’ve thought of a few ways to integrate these, but the best & simplest way might be to create a special “finished” action. The finished action has a transition reward corresponding to the intrinsic value of the state that the agent is in when it performs the finished action. After performing the finished action, the agent enters a special state where it can’t take any further actions — effectively turning itself off.

Informally, we’re creating an AI that can only earn reward by turning itself off — and when it turns itself off, its reward is proportionate to how good of a job it did of satisfying our preferences, according to its beliefs about our preferences. To avoid dis-incentivizing the finished action, a state’s intrinsic value will always be nonnegative. The finished action and the null action are the only exceptions to the general rule that all actions have at least a small negative reward.

A Technical Explanation of ARLEA

At this point, we’ve set up the framework for my latest and greatest AI alignment proposal. I call it Active Reinforcement Learning through Episode Annotation, or ARLEA for short. (When I say “active reinforcement learning”, I mean it in the same sense this paper uses the term, as a sort of active learning/reinforcement learning hybrid.) Although the idea is fairly simple, ARLEA has a bunch of cool properties which follow from the way it’s set up.

ARLEA consists of four parts:

  • A human overseer.
  • A “scorekeeper”, which attempts to form well-calibrated beliefs about the overseer’s preferences. The scorekeeper consists of an ensemble of advanced machine learning models. These models may initially be trained on data about the overseer’s preferences in order to form their priors.
  • An “episode interpreter”. This part can show the human overseer a “movie” of an RL episode that depicts challenges ARLEA may face and ARLEA’s proposed responses. The human overseer can then “annotate” this movie with transition penalties for ARLEA. At the end of the movie, the human overseer annotates an intrinsic value for the state of the world as depicted. After annotation, the annotated episode is sent to the scorekeeper, and the scorekeeper updates its beliefs about the overseer’s preferences using online learning methods.
  • A “player”. The player is a fairly standard “dumb” RL agent with a couple differences:
  1. Its rewards are determined by the scorekeeper. To compute scores, the scorekeeper removes positive outliers from its ensemble and calculates a confidence-weighted average of the remaining models.
  2. The player always has an available action of sending a query episode to the episode interpreter, which puts the player in a new (beliefs, environment) state where the environment is the same as before, but the beliefs correspond to what the scorekeeper now believes about the overseer’s preferences. The transition model for the scorekeeper’s beliefs is implemented using APIs provided by the scorekeeper that give info about the range of annotations the overseer is likely to provide and how the scorekeeper would change scores in response to those annotations.

To simplify my discussion, I’ll assume a serial world where time stops during episode annotation, and starts when the player chooses to take an action in the world. I’ll also assume that the discount rate of the player applies to the amount of time that has passed, not the number of steps which have been taken.

Since value of information is always positive, these assumptions produce a player which never acts in the world — it just repeatedly queries the episode interpreter to build an increasingly sophisticated model of the overseer’s values in the scorekeeper.

In order to cause the player to actually act in the world, we’ll assign a “cost of preference information” (CoI) to querying the episode interpreter. The CoI acts as a sort of “corrigibility knob” that controls the player’s willingness to act.

  • If the CoI is set to 0, the player never acts in the world.
  • If the CoI is set to a high value, the player charges ahead and implements the scorekeeper’s best guess about the overseer’s values.
  • An intermediate CoI produces a player that queries the episode interpreter when the scorekeeper is uncertain and the scorekeeper’s assigned score is decision-relevant. Like a good employee, ARLEA forms hypotheses about its boss’s desires, and only pesters its boss with questions when it thinks its boss cares about getting particular details of the assignment right.

Smile maximization case study

Following Nick Bostrom in Superintelligence, let’s suppose the AI’s utility function depends heavily on the number of smiling people, and the AI has discovered the perverse instantiation of paralyzing everyone’s facial musculature into constant smiles.

The most important assumption underpinning ARLEA’s safety is the scorekeeper’s ability to calibrate its uncertainty about the overseer’s preferences. But for the sake of this case study, we’ll assume that the scorekeeper’s beliefs about the overseer’s preferences are fairly inaccurate. Of five models in the scorekeeper’s ensemble, three assign +10 utility to the paralyzed facial musculature outcome and two assign -10 utility. The assessment of all five models is weighted equally.

The player considers two plans:

  • In the first plan, the player immediately implements facial paralysis. The player’s reward for the resulting outcome is computed by the scorekeeper using a confidence-weighted average of the models in its ensemble: 0.6 * 10 + 0.4 * (-10) = +2 utility.
  • In the second plan, the player queries the episode interpreter before taking action. To evaluate this plan, the player calls the API offered by the scorekeeper. The scorekeeper considers how the overseer might annotate a movie of everyone’s face getting paralyzed. The scorekeeper thinks the overseer is 60% likely to annotate the end state of the movie with +10 utility, and 40% likely to annotate it with -10 utility. Either way, we’ll assume the scorekeeper’s beliefs adjust to assign 100% probability to the utility the overseer provides. If the scorekeeper’s beliefs adjust so the expected utility of facial paralysis is negative, the player won’t take further action. Thus the player’s expected reward for the query-first plan is 0.6 * 10 + 0.4 * 0 = +6 utility.

The value of information in this case study is 6 - 2 = 4. Thus the player will query the episode interpreter if the CoI is less than 4.

Following Hadfield-Menell et al, the analysis thus far has been purely consequentialist. How can obligations such as “never deceive anyone” be integrated?

When the overseer annotates a movie with rewards, they provide both transition rewards (describing the moral value of actions the player takes within the movie) and an estimate of the intrinsic value of the world’s state at the end of the movie. Thus if the overseer doesn’t want the player to deceive anyone, they can annotate any deceptive behavior depicted in movies with a large negative transition reward. If that negative transition reward is larger in magnitude than any of the intrinsic state values the overseer has been annotating with, the player is unlikely to deceive in the service of any outcome. This provides a redundant safeguard: Even if the player is certain that the overseer wants their face paralyzed, the player will only consider plans for facial paralysis which don’t involve deception.

Soft labels for utility regression

I’ve been talking a lot about reward numbers. Rewards for taking action in the world are negative, and the finished action has a positive reward. Where do these numbers come from?

Instead of entering numbers directly, I suggest using soft labels. A quick explanation: Suppose we are trying to generate training data for a medical diagnosis classifier. We have access to an expert doctor, but the doctor’s judgements are not 100% accurate. In the same way moral judgements are noisy, the labels provided by the doctor are noisy. Soft labeling solves this problem by asking the doctor to rate the probability of various outcomes, then training an ML system with those as its target probabilities. (Contrast with “hard labels” which are always 0 or 1.)

Previous work on soft labels has been done in the context of classification. But I see no reason why the idea couldn’t be applied to regression. The simplest implementation might ask the user which of two quantities is greater and request an associated confidence judgement. (“Is the difference between the value of the state where your desk has coffee on it and the value of the state where your desk is empty sufficient to exceed the deontological penalty for pwning all the computers on the internet? … How sure are you about that?”) More sophisticated ideas might come from questionnaire design in psychology, or preference elicitation in computer science. (This preference elicitation paper looks especially interesting.)

The physical wireheading fallacy

Does ARLEA wirehead?

In Superintelligence, Nick Bostrom writes:

Variations of the wireheading syndrome can also affect systems that do not seek an external sensory reward signal but whose goals are defined as the attainment of some internal state. For example, in so-called “actor–critic” systems, there is an actor module that selects actions in order to minimize the disapproval of a separate critic module that computes how far the agent’s behavior falls short of a given performance measure. The problem with this setup is that the actor module may realize that it can minimize disapproval by modifying the critic or eliminating it altogether — much like a dictator who dissolves the parliament and nationalizes the press. For limited systems, the problem can be avoided simply by not giving the actor module any means of modifying the critic module. A sufficiently intelligent and resourceful actor module, however, could always gain access to the critic module (which, after all, is merely a physical process in some computer).

The parallels between the actor/critic idea and my player/scorekeeper idea should be obvious.

At the beginning of Superintelligence, Bostrom says many of the points he makes in the book are probably incorrect. I think this is a good candidate for a point that is essentially incorrect.

Consider humans. Human goals are not easily described in terms of external sensory reward signals. If my goal is to get rich, I will not be satisfied by a photoshopped screenshot of my bank account. If my goal is to keep my mother alive, I will not be satisfied by a video of her smiling that has been backed up multiple times in the cloud.

Hypothetically, I could wirehead the way Bostrom describes by asking a brain surgeon to remove my ability to feel discomfort. But I have no desire to have this done. Why not? Because the thought of never feeling discomfort makes me feel uncomfortable!

Removing my ability to feel discomfort would change my values. This change would cause me to behave in a way that’s suboptimal according to my current values. Thus I don’t want it.

Why will the actor avoid modifying the critic? Because the critic disapproves. Utility function preservation is a convergent instrumental goal. When the critic considers the scenario where it’s modified, it forsees bad outcomes. Therefore the critic will criticize any plans to modify it. The physical question of whether the critic would still disapprove after being modified doesn’t affect the actor’s receptivity to this criticism.

A dictator who nationalizes the press is not trying to please the current press. He’s trying to have people on TV who engage in the physical act of smiling and nodding their heads.

From a software developer’s perspective, encoding the goal of having the critic send a physical approval signal is far less natural than encoding the goal of actually pleasing the critic. To encode the first goal, the critic would have to compute the actor’s reward conditional on physical changes that happened to the critic midway through the actor’s plan. This is a sophisticated piece of additional functionality.

This is not a fully general argument against wireheading. AI systems which believe button pushes create reward will seek to push their own buttons. However, if an AI system has a switch on its side that will change the AI’s goal system to one that offers loads of reward all the time, the AI will only flip the switch if its current goal system thinks flipping the switch is a good idea. The key difference is the necessity of computing the AI’s reward conditional on a physical modification to its goal system’s structure. The AI with the reward button simulates self-pushing as leading to lots of reward in the context of its current goal system. The AI with the reward switch does not necessarily simulate self-switching as leading to lots of reward in the context of its current goal system.

ARLEA doesn’t anticipate manipulation

What’s to stop ARLEA’s player from manipulating the overseer into assigning high rewards? We’ve given it the ability to show the overseer movies…

Eliezer Yudkowsky writes that as soon as we’ve given our AI an incentive to manipulate us, there’s some sense in which we’ve already lost. In the previous section, I argued that what matters is not whether bad behavior works, it’s whether an AI simulates bad behavior as working. So our question is then whether the player simulates manipulative effects of movies. And since the player uses the scorekeeper’s APIs to understand the effects of movies on its rewards, our question is really whether the scorekeeper simulates the manipulative effects of movies.

Let’s define a movie as “manipulative” if it systematically shifts the expected value of the overseer’s preference for at least one state. Specifically, suppose S is some state of the world and E[S] is the expected value the scorekeeper thinks the overseer would annotate that state with. Suppose there’s a movie that will manipulate the overseer, and M is the event of the overseer’s annotation of this movie. As long as the scorekeeper implements conservation of expected evidence, E[S|M] must equal E[S]: “If you can anticipate in advance updating your belief in a particular direction, then you should just go ahead and update now. Once you know your destination, you are already there.” To use Stuart Armstrong’s terminology, all learning processes that involve learning about external facts are unbiased.

There’s no way for the player to “cheat the system” and estimate their own set of transition probabilities in the MDP. That would be a different MDP than the one which is being solved. And if the simulation says some reward-hacking policy won’t work, it won’t be enacted.

ARLEA as a scalable, implementation-agnostic way to achieve interpretability

Interpretability is a well-known research area related to AI alignment. I’ll distinguish between two types of interpretability:

  • Model interpretability, e.g. producing visualizations for the hidden units in a convolutional neural net to understand the patterns they detect.
  • Policy interpretability. As Stuart Armstrong puts it, the ability to accurately summarize policies is “nearly the entirety of the problem” with Tool AI.

I’m very glad to see people researching either kind of interpretability. However, I worry that interpretability may not scale well as AI systems get more sophisticated. More sophisticated systems will have a more detailed understanding of the world, which may make it increasingly difficult for a human interpreter to ensure they are completely correct. In informal terms: If an AI is smarter than us, then almost by definition we can’t understand things as well as it understands them.

If our “interpretability budget” is limited, a broad approach to scaling interpretability is: Add a second system which tries to direct a human overseer’s attention to the aspect of the system they are most likely to be interested in.

ARLEA could be seen as specific implementation of this broad approach. ARLEA’s beliefs can be interpreted through its choice of movies to display to the overseer. (To increase transparency further, the scorekeeper’s guesses about the annotations the overseer is likely to apply could be displayed alongside a movie while it was playing.) Movies are specifically selected to be decision-relevant points of uncertainty about the overseer’s preferences, so the overseer’s time is spent as efficiently as possible.

Another nice feature of ARLEA’s interpretability is that it seems relatively agnostic to the form of model/policy we choose to use. Traditionally, interpretability is done in the context of a e.g. a particular kind of machine learning model. If we decide a different kind of model is a better fit for our purposes, interpretability needs to be reworked. By contrast, episode interpretability makes no assumptions about the form of the underlying components, as long as they implement whatever interface is necessary for making movies.

ARLEA has a story for many problems in MIRI’s AAMLS agenda

In this section, I’ll review the 8 problems posed in MIRI’s paper Alignment for Advanced Machine Learning Systems and explain how a system based on ARLEA might solve each one.

1. Inductive ambiguity identification: How can we train ML systems to detect and notify us of cases where the classification of test data is highly under-determined from the training data?

The above smile maximization case study demonstrates how the ARLEA framework achieves this.

2. Robust human imitation: How can we design and train ML systems to effectively imitate humans who are engaged in complex and difficult tasks?

If the scorekeeper has received information about our preferences in the form of descriptions of us doing things, and we annotate movies with low transition penalties when ARLEA does things the same way we do them, then over time the scorekeeper will learn that doing things the same way we do them tends to produce greater net reward. (Another idea is to let the overseer manually describe & annotate an episode representing what the overseer wants the player to do, and annotate that episode with transition penalties that are close to 0.)

3. Informed oversight: How can we train a reinforcement learning system to take actions that aid an intelligent overseer, such as a human, in accurately assessing the system’s performance?

Suppose you’re the overseer and you are watching a movie where you aren’t very sure what’s going on. Since you’re not very confident, your annotations will be very “soft”, and the scorekeeper’s beliefs won’t change much in response. If the scorekeeper knows you well, the scorekeeper will accurately predict that you’re going to assign very soft labels to this confusing movie. Thus when the player uses the scorekeeper’s APIs to simulate the effect of a confusing movie, it finds the VoI is low and chooses a clear movie instead.

The MIRI folks offer the example of an AI that is assigned the task of writing original novels. Suppose the AI gives a novel to a human overseer. How can the overseer evaluate whether ideas in the novel were plagiarized? ARLEA solves this problem by showing the overseer a movie of the procedure it will use to write the novel before it starts. Given the above points, and your additional ability to assign negative transition reward to actions you don’t understand, ARLEA is incentivized to make this “movie” as clear as possible (showing you code in a programming language you understand, with clear variable names, etc. etc.)

4. Generalizable environmental goals: How can we create systems that robustly pursue goals defined in terms of the state of the environment, rather than defined directly in terms of their sensory data?

In their paper, the MIRI folks use the example of an AI that’s asked to make a sandwich. Instead of making a sandwich, the AI prints out a picture of a sandwich and puts it in front of its camera. In the ARLEA framework, we could solve this problem by ensuring that the scorekeeper uses the actions of the player as evidence to estimate the player’s score at the end of the episode. If things work correctly, the episode interpreter will show us a movie of sandwich making and a movie of sandwich printing, we will assign a high score to the first but not the second, and the scorekeeper will understand the difference.

5. Conservative concepts: How can a classifier be trained to develop useful concepts that exclude highly atypical examples and edge cases?

ARLEA makes use of an ensemble to understand its overseer’s values. A highly atypical example or edge case is likely to be classified differently by different members of the ensemble, making it high VoI to ask the overseer for clarification. To solve the problem of correlated error in the ensemble, we might improve the ensemble aggregation method to account for between-model covariance. Of course, a truly satisfying solution to this problem will require improvements to the models themselves, which is beneath the level of abstraction that ARLEA deals with.

6. Impact measures: What sorts of regularizers incentivize a system to pursue its goals with minimal side effects?

As intelligent animals, humans have carefully optimized our environment to suit our preferences. If ARLEA does not start with a prior belief that making random modifications to the world leads to negative transition reward and decreased intrinsic state value, it will quickly acquire one.

7. Mild optimization: How can we design systems that pursue their goals “without trying too hard”, i.e., stopping when the goal has been pretty well achieved, as opposed to expending further resources searching for ways to achieve the absolute optimum expected score?

A bias towards inaction is not usually enough to solve this problem, due to the possibility of unforeseen maxima. However, the ARLEA framework makes unforeseen maxima relatively unlikely. Since the scorekeeper is trying to predict scores assigned by the overseer as accurately as possible, it’s unlikely to assign a score far outside the range of scores that the overseer has been offering. By asking the overseer to score the intrinsic value of hypothetical states, ARLEA punts questions of infinities in ethics to the overseer — effectively using the overseer as an evaluation function.

Additionally, the only way ARLEA’s player is ever able to receive a reward is by performing the “finished” action which turns the player off. Suppose the player has achieved a state with intrinsic value that’s 99% of the maximum intrinsic value it believes to be achievable. If the player sees a way to push that to 100% with an additional hour’s work, and the player’s discount rate is 0.98 per hour, it will choose to shut off now since 99 > 98. Thus we have optimization pressures for the AI to bound its actions in both time and space.

8. Averting instrumental incentives: How can we design and train systems such that they robustly lack default incentives to manipulate and deceive the operators, compete for scarce resources, etc.?

ARLEA’s overseer can solve this problem by assigning negative transition reward whenever they see the player taking unwanted instrumental actions in a movie they annotate.

The nice thing about the episode annotation approach is that it does not require us to brainstorm a long list of forbidden instrumental incentives. Approaches like this don’t scale up well, because if an AI is smarter than we are, it will also be more creative when it does this kind of brainstorming. With the episode annotation approach, as the AI comes up with clever new ideas, it shows us a movie of them and we decide whether they are acceptable or not.

In the AAMLS paper, the MIRI folks tackle their 8 problems in a piecemeal way, trying to solve each problem individually. I’m more optimistic about trying to solve problems simultaneously for a few reasons:

  1. If each problem is solved individually, there’s an awkward and potentially error-prone step of merging the solutions together at the end.
  2. In the same way memorizing the training set rarely generalizes well, a framework which solves many problems simultaneously with a few moving parts is more likely to generalize to unforeseen problems than a patchwork of solutions.
  3. Simple approaches to friendliness will likely be faster to implement, which is useful in the event of an arms race.

However, a weakness of the ARLEA framework is that it’s tightly integrated with the MDP formalism, and may need to be reworked if other formalisms prove more useful for solving AI problems.

Another difference between our approaches is that the MIRI folks place a greater emphasis on manual feature engineering. For example, in response to problem 4, the MIRI folks suggest using a low-dimensional state representation to prevent drastic state changes that might correspond to the wireheading of sensory data.

I’m not optimistic about manual feature engineering. Developments in deep learning suggest that manual feature engineering is likely to be made obsolete by future advances. Also, manual feature engineering intuitively seems unlikely to scale. I’m much more optimistic about using an AI system’s intelligence to understand what we mean. As our AI techniques get more powerful, I’d like to make our supervision more powerful too.

Challenges for ARLEA

Enough upsides. What difficulties face the approach I’ve outlined? Here are a couple big ones:

1. ARLEA’s scorekeeper needs to be well-calibrated. Current deep learning systems aren’t good at calibration.

This is a big limitation. I have a bunch of ideas for addressing this limitation, mostly by trying to find the simplest possible model for unstructured data. If Gary Marcus is right, I expect methods for finding simple models in unstructured data will also address a lot of deep learning’s current limitations. So working on this problem may speed the development of AGI as well as FAI. I’d like to hear what others think about this.

2. In the technical description section, I assume time stops when the human overseer watches a movie. This assumption is obviously false.

I’m working on another essay that adapts the ARLEA framework to a concurrent world.

These two are big, obvious issues. But there many other possible weaknesses of this proposal. I’ve been exploring some angles of attack, but I’m curious to hear what others think, in case they see something I’m missing.