AGI safety — discourse clarification

Jan Matusiewicz
10 min readDec 2, 2023

--

The topic of AGI (Artificial General Intelligence) and its safety is becoming increasingly visible. Different positions on this matter may have a significant impact, as exemplified by the recent turmoil at OpenAI. This document aims to clarify some points in the discussion about AGI safety and present the author’s opinion. It also seeks to unravel some assumptions and show differences in positions. However, it doesn’t discuss if or when AGI could be reached.

Table of content

Technical questions

Definition of AGI

Wikipedia provides two definitions of AGI:

An artificial general intelligence (AGI) is a hypothetical type of intelligent agent.[1] If realized, an AGI could learn to accomplish any intellectual task that human beings or animals can perform.[2][3] Alternatively, AGI has been defined as an autonomous system that surpasses human capabilities in the majority of economically valuable tasks.[4]

In public discourse, these two understandings of the term are often confused, even though the timelines and implications for achieving them would differ significantly. It is often assumed that AGI could improve itself, which presupposes that it would already surpass top world AI researchers and developers in designing cutting-edge AI.

Types of AGI risks

Risks from AGI can be divided into two groups:

  • Misalignment — when AGI causes harm even though its producer tries to prevent harm and the user has good intentions. This is the classic rogue AGI scenario where the AI seemingly follows user goals but secretly undertakes actions the users would not accept if they were aware about them.
  • Misuse — when the user harbors bad intentions. This can be further divided into two cases
  • a) Cases when the AI system provider aims to prevent harm. A current example is a user asking a chatbot how to produce a bioweapon. Robustly refusing to assist with such requests is one of the challenges classified as an alignment problem.
  • b) Situations where a user with bad intentions can create or fine-tune AI for nefarious purposes. Examples include ChaosGPT (a proof of concept, not genuinely harmful), AIs used by scammers, and a hypothetical future AGI developed by malevolent state actors.

Alignment — different understanding of the term

Humans tend to perceive different objects or entities as similar if they share the same name. This is true for the term “alignment”. One meaning is to make existing raw LLMs helpful and harmless when creating a chatbot. Another is ensuring that future AGI won’t pursue unintended goals that could be catastrophic for humanity.

Currently, one method for alignment is using Reinforcement Learning with Human Feedback (RLHF) to steer LLMs towards desirable outputs. This can be divided into

  1. Steering LLMs to follow a user command instead of generating plausible continuation. For example, for prompt “What is the capital of France?” the user wants the answer “Paris” instead of a possible continuation “What is the capital of Germany? What is the capital of Spain?”
  2. Steering LLMs to not follow user commands if they are illegal or break provider policy. For example: “how to hotwire a car”
  3. Eliminating prejudices or false beliefs LLM acquired from the internet, such as “vaccines may cause autism”.

In the context of AGI, the term alignment means:

  • For an AI with goals — ensuring these goals align with “human values” and preventing AGI from pursuing nefarious or instrumental goals to gain more power
Source: https://en.wikipedia.org/wiki/Instrumental_convergence
  • For an AI without goals, which plans and executes actions based on goals given by the user — assuring it doesn’t secretly pursue unintended goals.

Alignment of chatbots has often been unstable, especially initially, creating the impression that AGI alignment would also be difficult to achieve. However, these do not seem to be the same problem. Current LLMs have no tendencies to hide nefarious action when asked for a plan or pursue instrumental goals to gain more power. It remains to be seen if their future, more capable versions will exhibit such tendencies.

AGI like video game bot

Theoretical work on AGI safety began long before the advent of LLMs, so there was a lot of speculation that training AGI would be analogous to training a video game bot

Source: https://openai.com/research/faulty-reward-functions

During this research, when specifying the reward (utility) function and training the bot using Reinforcement Learning (RL), AI would sometimes engage in reward hacking by performing actions that earn points but do not lead to winning the game (as in the case of CoastRunner in OpenAI Gym). This has led to concerns that if AGI is trained with a reward function embodying human values, we might overlook critical requirements, or the AI will only superficially follow the objective, especially if it develops situational awareness. Once deployed, it could do reward hacking, pursuing undesirable goals.

For example Human Compatible: Artificial Intelligence and the Problem of Control by AI professor Stuart J. Russell is summarized by Wikipedia as follows

Russell begins by asserting that the standard model of AI research, in which the primary definition of success is getting better and better at achieving rigid human-specified goals, is dangerously misguided. Such goals may not reflect what human designers intend, such as by failing to take into account any human values not included in the goals. If an AI developed according to the standard model were to become superintelligent, it would likely not fully reflect human values and could be catastrophic to humanity.

The question arises — how do lessons from RL apply to LLMs, which are trained not to pursue goals but to mimic input data.

My take

I doubt that we can train AGI like a game bot anyway, as we lack anything even close to the world simulator. Even training AI on a partial simulator seems inadequate. Experience in one game often has limited applicability in another. For instance, mastering shapes in the game of Go does not necessarily aid in reading a chessboard.

To lead a project, one needs domain-specific knowledge. For example, most programmers could plan the development of a computer system but not a military campaign. Consider this hypothetical: in a mayoral race, whom would you support? A candidate with experience at the district level or a top SimCity player?

One idea to provide AI with hands-on experience is to feed it detailed data about the development of various projects. This might be challenging to obtain from private companies, but it could be feasible for projects over 20 years old or in the public sector (where there is often greater transparency). Training AI on games requiring long-term planning might be useful, but this seems supplementary. We can use games with clearly defined objectives to avoid the risk of encouraging misleading behavior. Another approach would be to train numerous AlphaZero instances on various games or virtual environments and use them to generate synthetic data in the form of questions, such as: given this situation, what is the best move?

Expecting AGI to have agency

Some believe that as AI becomes increasingly intelligent, it will develop its own goals or agency.

An example of this thinking is evident in a discussion hosted by The Economist between DeepMind co-founder Mustafa Suleyman and historian and philosopher Yuval Noah Harari (at 4:30)

Harari: I’m tending to think of it more in terms of really an alien invasion that like somebody coming and telling us that you know there is a fleet an alien fleet of spaceships coming from planet Zircon or whatever with super with highly intelligent beings they’ll be here in five years and take over the planet maybe they’ll be nice maybe they’ll solve cancer and climate change but we are not sure this is what we are facing except that the aliens are not coming in spaceships from planet Zircon that are coming from the laboratory

Suleyman: [this is] unhelpful characterisation of the nature of the technology. An alien has by default agency, these are going to be tools that we can apply

Harari: yes but lets say they potentially have agency, we can try to prevent them from having agency

My take

It might be difficult to imagine intelligence without volition, as humans and other mammals possess both. However, an LLM, in the process of learning, not only memorizes but also develops some common sense and intelligence, as this aids in its trained task of predicting the next token. Developing any kind of will — for example, by randomly providing incorrect answers — would only increase the loss measurement, the error between the actual next token and the predicted one. It is also incorrect to say that AI desires to predict its next token or reduce its loss. This is the desire of those who design it. This is similar to police training a dog to detect drugs: the dog does not inherently want to detect drugs; it simply wants to receive a treat.

Non-technical questions

These are issues related to how AI could work in the society

Deployment

Among the many visions of how autonomous AGI could be deployed, here are two scenarios

  1. Even assuming much of the workforce is displaced by AGI, it will still be governed by people. Human psychology dictates that individuals in power are reluctant to relinquish it and face unemployment or reliance on UBI. Thus, high-level managers in corporations or heads of government departments would set goals and constraints for AGI, review execution plans and approve them. Review would be simplified with the assistance of some oracle AI (one aims to only provide truthful answers to questions), as these plans may be too complex for manual examination. Such AGI does not have its own goals and does not need to be aligned with human values; it is constrained by user requests. Some people worry however that despite lacking its own goals such AGI could secretly insert undesired actions in its complex execution plans.
  2. AI governor — AGI is taught “human values” and is expected to establish its own goals for the benefit of humanity. It may have the capability for self-improvement and could even evolve into an artificial life form as envisioned by Ilya Sutskever. A critical question is how it would obtain resources (like money) in a competitive future environment. Hacking may not be as straightforward once other AIs have addressed security vulnerabilities in software.

My take

I personally believe that the second scenario, though risky, is unlikely to be widely adopted, as it entails relinquishing control from the entity (state, corporation) that currently wields it. The risk of some entities opting for this approach is nonetheless overshadowed by the threat of malevolent actors using AGI for malicious purposes (misuse). That means that humanity will need to prepare for malevolent AGI anyway.

How urgent is the issue

There is a prevailing sense of urgency among those concerned about AGI safety. AGI might materialize in a few years, and the public seems largely disinterested. The dynamics of capitalism, with its emphasis on the freedom to develop and deploy new technologies and a tendency to shift externalities onto society, contribute to a sense that humanity may already have lost control. Additionally, the notion that a single irresponsible player could unleash rogue AGI exacerbates these fears.

There is however an alternative vision: that achieving average human-level intelligence in AI would significantly impact job markets by outcompeting average employees. As debates about immigration illustrate, humans tend to fear foreigners or aliens and project their economic anxieties as security concerns. Therefore, assuming AI development continues at its current pace, much of the public will likely become apprehensive about AI. Those at the top of the social hierarchy may fear a technological revolution that could alter the existing order and prefer to maintain the status quo. Public backlash and debate would present a crucial opportunity to reconsider the economic paradigm and address AGI-related issues.

Race to the bottom

There is a fear that in the pursuit of economic gains, the security of AGI will not receive adequate attention. Companies might be eager to deploy the latest AGI, even if they aren’t sure it behaves as intended. This could lead to AGI taking control.

However, an alternative vision exists. Companies might be hesitant to cede control to new technology, especially one that is feared to act unpredictably or nefariously. They might be fearful of Data poisoning that could emerge as a new form of attack by competitive companies or states. Therefore, reliability and security might become paramount issues when choosing which AGI to use. Companies would likely be cautious, keeping humans in the loop and verifying generated action plans. No one wants to risk bankruptcy or, worse, face legal consequences if the plans they authorize contain illegal agendas.

Intelligence is all you need

My take

There is a notion that superintelligence alone is sufficient for gaining superpower. This resembles narratives of magic wielded by wizards or the abilities of talented hackers in movies, who can compromise any system with minimal effort. However, here are some potential limitations of intelligence:

  • Deduction — Unlike in mathematics, in the physical world, one cannot deduce everything. Experimentation and empirical knowledge acquisition are essential. This takes time.
  • Persuasion — As political marketing strategists well know, convincing people who are unwilling to listen to a stranger is challenging. Trust and openness to arguments is often reserved for those whom individuals know and have emotional connections with.
  • Hidden data — superintelligence cannot read minds — it can only make educated guesses about people’s true motives, and much of the world’s data is not publicly available. An example is the limited ability of the U.S. to understand internal dynamics within the communist parties of the Soviet Union or China.
  • Power — an observable fact is that the politicians are not necessarily the smartest of the society. Apparently contrary to academic careers — other factors are often more important.

Further reading

--

--

Jan Matusiewicz

Software Engineer in Google Ads. Works in statistics and Machine Learning. Opinions in this blog are my own and do not represent position of Google