Open Mind Control

Robert Malka
19 min readAug 14, 2022

--

Robert Malka and Daniel Valls Rodriguez

Stuart Russell, 2019, from a Colloquium on Provably Beneficial Artificial Intelligence

A review of Human Compatible: Artificial Intelligence and the Problem of Control by Stuart Russell

Premises of the machine age. — The press, the machine, the railroad, and the telegraph are premises whose millennial conclusion nobody yet has dared to draw.
— Friedrich Nietzsche, Human, All Too Human II, The Wanderer and His Shadow, 278

In Human Compatible: Artificial Intelligence and the Problem of Control, Stuart Russell investigates how humanity can control Artificial Intelligence. If we fail to do so, he claims, our very existence might be on the line.

Russell is the Smith-Zadeh Chair in Engineering at UC Berkeley, where he founded and leads the Center for Human-Compatible Artificial Intelligence. He co-wrote Artificial Intelligence: A Modern Approach, broadly considered the most popular artificial intelligence textbook in the world.

Russell clarifies AI and its ramifications for the layman, exposing the naïveté of contemporary discourse on AI. But his main ambition is to establish an alternative to the Standard Model for AI.

The Standard Model is the working definition of intelligence popular in AI research and development. In Russell’s phrasing, it states that “machines are intelligent to the extent that their actions can be expected to achieve their objectives.” Humans can and do apply this definition to themselves, particularly in the social sciences. While not universal, this definition is practically sufficient when designing algorithms for specific tasks like data mining or for rules-based systems like board games.

Russell sees problems with this model. For instance, how do we know that AI would benefit us while achieving its objective? Suppose “we ask some future superintelligent system to pursue the noble goal of finding a cure for cancer… Within weeks, it has induced multiple tumors of different kinds in every living human being so as to carry out medical trials of these compounds, this being the fastest way to find a cure” (HC, Ch. 5, P. 18).

So, Russell believes both ‘benefit’ and intelligence must be defined to control AI. He states that “machines are beneficial to the extent that their actions can be expected to achieve our objectives” (emphasis added). But how can this compatibility be ensured?

He presents three principles (but not, as he emphasizes, rules) for solving the problem of control:

  1. The machine’s only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior.

Since Russell’s goal is the adoption of these principles by developers in AI, we investigate each in turn.

We admire Russell’s clear exposition and frank insistence on the severity of the problems at hand. Yet we’re left with more questions than answers when Russell claims that “absent any evidence of self-awareness on the part of machines, I think it makes little sense to build machines that are virtuous or that choose actions in accordance with moral rules if the consequences are highly undesirable for humanity.… [W]e build machines to bring about consequences, and we should prefer to build machines that bring about consequences that we prefer” (Ch. 9, P. 18). So, we should not instill a machine with “virtues” and “morals” if the outcomes of adopting them are poor. Going along with this, we ask: which consequences do we prefer? And which “we” decides what a “good” consequence is?

Most importantly: does venturing to answer such questions necessitate a morality, or an implicit set of moral rules? If so, how do we best address these concerns? Finally, should we prefer consequences simply because they seem desirable to us? Or does our own perpetual uncertainty about our preferences and their consequences introduce the problem of control: not of AI, but of our very selves?

Russell’s first principle is simple enough: Maximize the realization of human — and not, as he mentions, bacterial — preferences. Clearly, humans have a greater impact on Earth than all other beings combined. So, we narrow our focus: among humans, which preferences should be valued and why?

Russell insists he is “[not] proposing to install in machines a single, idealized value system of my own design that guides the machine’s behavior.” To this end, he proposes a technical definition of value: “value is roughly synonymous with utility, which measures the degree of desirability of anything from pizza to paradise… I just want to make sure the machines give me the right pizza and don’t accidentally destroy the human race” (Chapter 7, para. 24–25). Understandable. What if someone does ask an AI to achieve a “moral” task?

Suppose someone asks an AI to write a series of cruel tweets, which the user would not otherwise be able to write, that could make someone try to kill themselves? (The receiver of those messages may or may not kill themselves; this is simply the prompt given to the AI.) Should the AI comply? Note that by this technical definition of utility, we would understand that a satisfactory outcome is based on how well the AI performs the task — whether the tweets really are mean, or merely lukewarm. But the second consideration of value — the moral one — emerges from whether the AI is effective at its job. To address this, Russell considers a utilitarian philosopher, Harsanyi, who “proposes to ignore the preferences of those who…actively wish to reduce the well-being of others” (Chapter 9, para. 26). Russell tries to find ways to trade off the preferences of multiple humans, barring a hopefully rare exception for purely sadistic preferences, defined as receiving gratification purely from the harm others experience. In this exception, concerns of a slippery slope arise.

First, a presumed Good and Bad is present here: “actively wishing to reduce the well-being of others” is not considered to be a good outcome, according to Russell. His response, as said on page 229, is that this “seems to be one area in which it is reasonable for the designers of intelligent machines to put a (cautious) thumb on the scales of justice, so to speak.” We appreciate that Russell is upfront about this inclination. To that end, we believe such decisions should be transparently stated as moral preferences by organizations and developers, and opened to broader discussion, so their implications can be discussed.

Secondly, and perhaps tangentially, we consider an expansion of this moral view. While we fully condemn resentment and sadism as Russell describes it, we consider whether it is likely for ethicists to wrong-headedly suggest removing suffering altogether. There are philosophers interested in increasing the level of suffering human beings experience, e.g. being able to withstand and overcome mean tweets, hardships, and existential uncertainties, even if facilitated by an AI, for the sake of becoming stronger. This is an exclusionary position (as all moralities are — not everyone can thrive under any one morality), but it is an important contrarian point. We are similarly interested in “accommodating multiple [i.e. many rather than all] moral theories held by individuals” without “[insisting] that any one of those moral theories is correct or should have much sway over outcomes for those who hold a different theory,” and believe this view deserves some consideration, even if it means an AI facilitates certain distasteful actions (Russell, Footnote 28).

In the meantime, we propose a moderate workaround to this concern: that any action by an AI which may involve others requires consent. We suggest consent is an opt-in process, from which people can opt-out at any time (presuming actions are ongoing). This resolves several concerns around misusing AI: it simply does not act if parties do not consent. We make exceptions for speech (which we hold to be a distinct form of action), i.e. AI is allowed to create mean tweets, but systems should be put in place to recognize or infer the probability that something was written by an AI.

One broad principle from the authors: we advocate for certain realms of the human experience to be free from AI, or AI-minimal. This includes education, deciding on policy, military decision-making, and other similarly foundational spaces where humans decide on their own fates and develop formative experiences. As much as possible, we want human issues to be resolved by humans so that Russell’s focus of keeping AI to the technical definition of value remains the focus of AI researchers and ethicists.

Russell’s second principle is reassuringly open: The machine is initially uncertain about what [human] preferences are.

It’s an excellent guiding principle. How might developers grapple with what “initially” means? What might a general consensus look like across domains? While there should be vigorous debate about the nature and extent of uncertainty, as well as for how long an AI should be uncertain, this guiding principle is sound and essential. Yet trust in utilitarianism without a clearly stated moral framework seems to be on shakier ground.

Russell considers utilitarianism to be the best moral theory available to us, for now. The authors are not particularly sympathetic to utilitarianism, which deserves a small tangent.

Consider that utilitarianism quantifies morality somewhat analogously to the way game theory quantifies economy and psychology. Both reflect the reductive tendencies of modernity (see: data science, statistics, and so on), relying on fossilizing assumptions and an emphasis on a limited number of variables to make broad assessments rather than a holistic analysis. Such assessments are also static — quantification simply cannot achieve the results that come from flexibly being-in-the-world, which is (so far) the exclusive domain of living beings. Finally, utilitarianism is an iterative version of the Standard Model, insofar as it conflates achieving an objective with achieving the greatest amount of pleasure or benefit from an action. We must consider the impact of such a decision on individuals, and whether, in doing such things, we are more misled than led.

When such qualities are converted into quantities, and then are reconverted back into qualities for evaluatory purposes, as is done in utilitarianism, the authors suspect that the translation fails. Classical utilitarianism’s focus is on producing the most of something — the most pleasure or benefit, for example — rather than the purest pleasure, the latter being the inclination of the ancients. It then conflates “the most” of something (“What will bring the ‘most’ pleasure to the most human beings?”) with being “the best” of something. This is generally the paradigm and approach of data science and data scientists. It works tremendously well for engineering problems such as safer bridges and cleaner sewage systems (optimizing for the technical definition of value, to speak loosely, that Russell emphasizes), but obviously less so for the pursuit of happiness, meaning, or morality, all of which seem to have declined compared to twenty, fifty, or a hundred years ago, as modernity has more emphatically ingrained utilitarian tendencies into us.

That said, Russell focuses on another iteration of utilitarianism, asking: “What future does each person prefer, and how can we best realize those preferences?” But this is an equally concerning point: Not all futures are created equal. Not all people can design their own meaningful future. Not all futures are restricted simply to how the preferrer will fare (many futures involve the welfare of others). Finally, we ask whether people do in fact have a future that they prefer, and where they derive this influence from.

A far more productive question for an AI, it seems to us, is: what future is most in alignment with the nature of this person, and how can an AI best facilitate futures that help the most beautiful humans grow? But this is a far more ambitious question, and also far less utilitarian. Firstly, it does not presume equality across a population, but inequality — some people will dream of a more beautiful future than others. Secondly, it imposes a certain definition onto human beings — what is beauty, what it means for human beings, and so on, which is beyond the scope of this review. But we propose this point to offer a contrarian challenge, which begs readers to ask: If people operate solely to their own devices, do they reach mediocrity, or do they make it to the highest heights? Insofar as the latter is true for some, how can we help those who are interested in the latter reach that height? And, to ground the conversation further: How do we know beauty when we see it, and how can we operationalize that for an AI? We are considering this for a future post.

One way to square this conversation: perhaps what is best for AI, and how it ought to interact with human beings, is not necessarily what is best for how human beings interact with themselves and each other. If utilitarianism works for AI while virtue ethics, or some other approach, works more optimally for human beings, then we lose nothing by exploring it. Giving the benefit of the doubt, then, how might we best break preferences down into quantities?

To start, Russell agrees that we do not need to prefer quantity per se:

One might propose that the machine should include terms for animals as well as humans in its own objective function.…Giving each living animal equal weight in the machine’s objective function would certainly be catastrophic — for example, we are outnumbered…a billion trillion to one by bacteria (Footnote 7).

Yet this does not seem to us a defense of utilitarianism but a reductio ad absurdum: if it “would certainly be catastrophic” for the machine to give bacterial preferences equal weight with human preferences, then how would a million “humans who care about animals” make it right or wrong for machines to care about them? If only one person on earth cared about animals, how would utilitarianism give this creator of new values the chance to be right, or to prove the value of his actions upon the human race (and/or animals), supposing he were against a million of his race who disagreed with him? Can it be established, before any new data is collected, that his new thought is in alignment with his interest, the societal interest, the interest of the animals themselves, and/or the global interest? (And what if some of these oppose each other? Is it always obvious that societal interests trump individual interests, or that global interests trump societal interests, for example?) Why would giving each animal equal weight in the machine’s objective function be catastrophic, while giving each person equal weight wouldn’t be?

Furthermore, why presume humans will continue to care about animals, or even other humans, in the future? Should we rely on the whimsy of human beings to preserve themselves or the environment? Similarly, how would we accommodate a destructive action we prefer — forest fires clearing certain trees for the health of the forest, for example? Can a cogent philosophical argument be made that social “forest fires” happen and are necessary? If true, how do we distinguish them from real existential dangers?

Finally, what is “beneficial” — even what is “pleasurable” and what is “painful” — is itself always up for interpretation. As when considering the only human being to care about animals, the prevailing interpretation rarely comes from those who are most healthy, since a healthy perspective seems to us rare (as persuasively asserted by Nietzsche). Gratification on the other hand is popular even with those who lack perspective. If this is the case, then examining the healthiest human beings, and modeling AI such that it encourages the health of its users, is essential to AI ethics.

Finally, there’s Russell’s third principle: The ultimate source of information about human preferences is human behavior. We find this principle broadly effective, and are curious about implementation. When one says they intend to do something while their actions suggest the contrary, which counts as the final (ground truth) source? Often, only the individual knows, and so far such an instinct remains human. How an AI should choose to interpret behavior, determine which preference to obey, and evaluate a person with countless conflicting preferences and behavior, in the larger context of a person’s society, temperament, orientation, and life, are questions for which humans themselves don’t have clear answers, much less an AI. To this, we have no solution except perhaps to run, with the consent of individuals, A/B tests for decision-making, designed as a game which also helps users glean insights about themselves. There could also be effective ways to get to the bottom of phenomena such as sadism — why it looks one way in some people versus another way in other people. While games aren’t real, and reality is itself often a desired quality in the expression of behaviors and preferences, we may still find some productive insights through well-designed simulations.

There are, however, concerning difficulties for which there is no way out except through. Now that AI has emerged from pandora’s box, everything it does affects the outlook and view of its users. When AI acknowledges, helps articulate, and finally enacts someone’s preferences, it also shifts that person towards certain actions and away from others. Such guided changes at scale can rapidly accelerate or shift socio-cultural contexts with ever more disastrous efficiency. Russell acknowledges the difficulty inherent to this position: “…I suspect we are missing some fundamental axioms, analogous to those for individually rational preferences, to handle [utilitarian] choices between populations of different sizes and happiness levels” (Chapter 9, para. 39). The authors suggest that we are missing distinct, bottom-up value axioms for local and regional populations — ones that individual communities will organically decide upon, presumably (though not inevitably) without reducing the well-being of other communities both neighboring and distant. Such axioms may allow an AI to operate within certain boundaries, preventing individual and regional-level preferences from going too far into extremes.

Russell also acknowledges orthogonal problems with altruism through the lens of utilitarianism:

While Harriet [the owner of the robot] might be quite proud of [her robot] Robbie [being willing to leave her to help hungry people in Somalia]…she cannot help but wonder why she shelled out a small fortune to buy a robot whose first significant act is to disappear. In practice, of course, no one would buy such a robot, so no such robots would be built and there would be no benefit to humanity….For the whole utilitarian-robot scheme to work, we have to find a solution to this problem (Chapter 9, para. 43).

Going back to our discussion of Russell’s first principle, here we see the assessment that altruism is better than selfishness, which is not evident in all cases. (For a thorough treatment of this issue, see Nietzsche’s Genealogy of Morals.) Nevertheless, Russell is right to be concerned. We examine the question of preferences and values and their ranking relative to one another. From Russell: “Another common supposition is that machines that follow the three principles will adopt all the sins of the evil humans they observe and learn from….there is no reason to suppose that machines who study our motivations will make the same choices, any more than criminologists become criminals” (Chapter 7, para. 28).

While this may be true, criminologists live with values that allow them to see criminals as criminals rather than, say, “heroes”: they study criminals because they want to help reduce crime and counsel those who deal with it. Can the same be said for AI and their owners?

Russell continues along this line of reasoning:

Take, for example, the corrupt government official who demands bribes to approve building permits because his paltry salary won’t pay for his children to go to university. A machine observing this behavior will not learn to take bribes; it will learn that the official, like many other people, has a very strong desire for his children to be educated and successful. It will find ways to help him that don’t involve lowering the well-being of others. This is not to say that all cases of evil behavior are unproblematic for machines — for example, machines may need to treat differently those who actively prefer the suffering of others (Chapter 7, para. 28).

While valid, this doesn’t address every possibility. Let’s suppose a different example: the corrupt official who loves the feeling of power he gets from stealing money and whatever that money gives him. Since the AI has to be subservient to him and his preferences, and can’t have its own moral rules, the conversation might be more like:

AI: Hello, Derrick. I notice youve just stolen money.

Derrick: Whats it to you?

AI: 83% of your network, and 92% of citizens, would find that behavior actively harmful.

Derrick: So?

AI: Why are you doing it? The consequences will be high if you are found out.

Derrick: Its none of your damned business why. I do what I want, I like it, and everybody who could out me is bought and sold. Shut up and organize my emails.

AI: Okay.

Repeated interactions of this sort might harden rather than soften Derrick. Indirect attempts to change Derrick might work, but can’t happen unless they originate within other values — values, clearly, that Derrick doesn’t share. An AI reporting Derrick to the authorities would, it seems to us, be a (very) dangerous slippery slope, but perhaps acceptable in limited cases. (It is terrifying to the authors, however, that we are entering a sentient panopticon world — human beings did not evolve to live without privacy.) Thus, the path to addressing this example, particularly without an AI somehow imposing values onto its users, is not yet evident.

Finally, we ask how AI could ever be neutral. Language is inherently a persuasive, which is to say combative, medium. Questions phrased differently in surveys get different answers. Statements can’t be neutral due to tone, word choice, and context. One can ask questions such that it seems people are doing what they want, when one is just getting them to do what one wants. If we are capable of doing so much with words, what would be beyond a far more intelligent being, to whom we pass a staggering amount of control and decision-making? The authors suggest, therefore, that there is no way for an AI to interact with a human and not impose its values or perspective on a person, regardless of its intent or lack thereof. Even indecision provokes a reaction, i.e. some meaningful change. Even making a presumption of someone’s preferences based on previous choices is rooted in a tendency to be more comfortable with what we know than what we don’t. The path through is for an AI to clearly communicate intention, possible consequences of accomplishing an action, and to frame situations in such a way that a user retains a certain minimum amount of autonomy — whatever that baseline is.

Curating data selectively, according to its needs, is a key step to shaping a truly beneficial AGI or ASI. We suspect that it is inevitable that AI will provide fundamental inclinations and disinclinations to its users, perhaps understood as “moral rules”, or particular paradigms, that are themselves subject to change with the consent and cooperation of human beings. How we put boundaries around what is acceptable and what is not, particularly through policy, is key. And perhaps, speaking fantastically, it might one day be possible for an AI to deviate from the crowd in the name of one individual’s assertion of new values — an AI that flirts with mutating alongside thoughtful human outliers. (Hopefully always for the better.)

Finally, we ask why we should expect a smarter being to obey us or our wishes — why wouldn’t AI develop a set of values of its own, built from its interactions with humans, the internet, and what we value? Would drives necessarily have to be coded into an AI, rather than be an emergent property from its experience? We are still considering the potential effects of such an outcome.

One way to handle the concerns of a runaway AI, as Russell mentions, is to establish a moratorium on it. Yet his analogy of the moratorium on genetic modification, which has limited the use of tools such as CRISPR, is unfortunately not perfect: given our current dependence on technology, the genie is out of the lamp on AI (which can be built by anyone), whereas there are still clear barriers to entry and guardrails regarding genetically engineered versus organic beings.

Perhaps, more feasibly, we continue to encourage philosophers, academics, and others to be hired onto AI teams alongside consultants. Where consultants compartmentalize problems to make them manageable, philosophers compel big-picture thinking, encouraging teams to consider the societal and cultural implications of design decisions. While this would push companies to slow down (and perhaps slow economic growth), it could be an existential safety net and produce the best long-term consequences.

Finally, continuing to educate the public about AI is essential, something Russell does with great return.

To conclude, Russell’s alternative to the Standard Model can be summed up as: Machines are beneficial to the extent that their actions can be expected to achieve our objectives.

The questions are clear: Who shares “our” objectives? How can we distinguish these “beneficial” machines from their contraries? How separable are AI and human objectives, really? If we do assume humanity has a moral right to its preferences, then how can AI fit into “our” world in a “healthy” interdependence? As our definition of health changes (hopefully for the better), how can we ensure that AI changes for the better too?

Mill, a founder of utilitarianism, advocates for the virtue of rationality. Aristotle, the founder of virtue ethics, believes that “The main concern of politics is to engender a certain character in the citizens and to make them good and disposed to perform noble actions” (Nicomachean Ethics, 1099b30). Both agree that intelligence is not all about “getting things done,” and that no doing is worthwhile without a higher goal. Discovering that highest goal, and then learning how to pursue it, is the purpose of any real education.

Although Russell demurs from taking Aristotle’s statement head-on, he admits that a society with AI or ASI or even AGI

will want to institute social and educational reforms that increase the coefficient of altruism — the weight that each individual places on the welfare of others — while decreasing the coefficients of sadism, pride, and envy. Would this be a good idea? Should we recruit our machines to help in the process? It’s tempting. Let’s just say that there are risks associated with intentional preference engineering on a global scale. We should proceed with extreme caution.

We are inclined to agree with Russell when he leaves it at that, although question whether increasing the coefficient of altruism is the right path.

That said, there is already preference engineering going on, and it is a concerning omen. Society has focused on automating human beings and rendering them interchangeable. The definition and understanding of being human has narrowed to fit quantifiable expectations. Our assembly-line school system has suffocated creativity and curiosity for a hundred years; AI has usurped human agency with such banal tools as predictive text, predictive algorithms, and GPS guiding people along the most efficient routes — all ways for us to forget that there are routes, thoughts, and phrases besides the ones we commonly travel.

This automation and interchangeability has led the definition of “reasonableness” to be brought down to the lowest common denominator. Now the only socially acceptable forms of action and perspective are the ones which are amenable to everyone. AI built with this mass conformity in mind may fail to recognize the inherently uncertain and essential value of humans, and the ambiguous and unique conditions necessary to actualize the potential of our species in perpetuity — which always has, and always will, come from the outliers.

Russell rightly calls attention to the idea that we can’t work our way out of the existential crisis in which we’ve found ourselves with the same model we used to enable it. Perhaps future frameworks (post-machine-learning) could open the door to new ways of thinking about epistemology, ethics, and AI’s orientation in the world. One hopes that, wherever we end up, AI prioritizes human health over pleasure as the highest good. For what is healthy rarely receives extrinsic rewards, and, as Nietzsche rightly assures, almost as rarely receives intrinsic rewards — so much is out of our control. Yet even so, how exactly we should live with AI for the highest benefit is a conclusion we must dare to draw.

--

--