The Paradoxes of Generative AI Alignment

10 min readSep 22, 2023

This piece explores two central paradoxes of modern AI that impact currently available AI applications and future solutions seeking to achieve artificial general intelligence (AGI). The topic of “alignment” has been receiving significant attention with the rise and growing popularity of Generative AI. The distinguishing feature of Generative AI applications is that they are able to produce new textual, audial, visual and audio-visual outputs based on, but creatively distinct from, their training data. Alignment in the AI context means the research and engineering practices focused on ensuring sophisticated Generative AI applications remain aligned with human values. Generative AI chatbots in particular are known to “hallucinate” incorrect outputs. Additionally, the large language models (LLMs) on which Generative AI chatbot applications are based are pre-trained on vast corpora of data that include toxic, abusive, and biased forms of dialog and language use. LLM-based chatbots are subjected to multiple forms of training (e.g., fine-tuning, reinforcement learning) to help steer them away from generating outputs that reflect such inaccuracies, toxicity and the like.

Alignment, as a subfield of AI, is therefore concerned with ensuring that AI applications provide constructive responses in their interactions with human users and are generally oriented toward being amiable and supportive of human beings and their goals more broadly. Additionally, particularly for AI development that aspires to achieving AGI or superintelligence, alignment is very much focused on mitigating existential risk. Such risk entails the possibility that highly capable AIs may determine their own superiority to human beings and attempt to subjugate or eliminate us. The field of AI alignment thus presents at least two paradoxes, one affecting Generative AI applications in use today and the other relevant to anticipated AGI and superintelligent AIs (“advanced AIs”).

PARADOX 1: LLM-based AI applications are pre-trained on large volumes of nastiness, but are expected to output only amiable, constructive content.

As many AI companies and commentators have addressed, LLMs are pre-trained on a broad range of private and public datasets that include the good, the bad and the ugly of human dialog and linguistic expression. Much effort is subsequently devoted to fine-tuning, reinforcement learning, and classifier-based restrictions on how LLMs process prompts and generate responsive outputs that avoid undesirable uses of language. There have been many instances of prominent LLM-based chatbots producing problematic outputs, particularly if a user sets out to bait the LLM into circumventing its training. How then does this create an alignment paradox?

Putting aside the question of whether it is possible to train LLMs effectively on only sanitized examples of human language expression, we need only examine the realities of the human character. Humans are cognitively, morally and behaviorally complex creatures. Even among the majority of us who may not have criminal, psychopathological, or other antisocial tendencies, we may still have a “bad day” or “bad moment” from time to time. Humans are emotional creatures, subject to mood swings and being triggered by challenging situations and people. Some humans are very self-centered, often failing to consider (or having callous disregard for) the impact of their words and conduct on others. Some people, otherwise well-motivated, simply lack a sense of social etiquette and good manners. That is all at an individual, personal level. Criminality, cruelty, vulgarity, mental instability, personal volatility and social insensitivity have been features of humanity throughout history and across cultures. These dimensions of the human condition are all reflected in the linguistic expressions contained in pre-training corpora for LLMs.

At community, societal and international levels, human outlooks and values often differ based on cultural, religious and political differences. Norms and sensibilities that may be broadly embraced within given collectives may nonetheless diverge between any two communities, regions or countries (e.g., capitalism vs. communism; democracy vs. monarchy vs. autocracy; polygamy vs. monogamy; etc.). Human collectives learn to navigate their differences through diplomacy, arm’s length dealings, détente or coexistence. But competition and conflict can and do frequently arise, either caused or exacerbated by such differences. These dimensions of the collective human condition are also reflected in the pre-training corpora for LLMs.

Whether we like to embrace it or not, “human values” encompass the less savory and antagonistic interactions that happen on a daily basis between human individuals and human collectives. This is particularly the case for human mindsets and behavior that are broadly accepted within collectives but differ between them (i.e., while certain criminal behaviors are universally shunned, political ideologies like socialism can predominate within certain collectives while being shunned or met with skepticism by others).

When we discuss AI alignment, therefore, the effort is actually focused on aligning AIs with human virtues rather than human values more broadly. In this sense AI developers look for universally embraced norms and sensibilities at both the individual and collective levels and to set these as alignment goals. There is potential consensus within the overlaps between embraced human virtues across individuals and collectives. For instance, Anthropic’s approach to reinforcement learning using “Constitutional AI” for its Claude chatbot emphasizes inculcating in natural language certain constitutional principles that Claude can reference to help hone its outputs and keep them helpful and harmless. These principles include the UN Declaration of Human Rights, Apple’s Global Terms of Service (since “some of the challenges of LLMs touch on issues that were not as relevant in 1948, like data privacy or online impersonation” that “address issues encountered by real users in a similar digital domain”), encouragement to consider “Non-Western Perspectives,” and Google DeepMind’s “Sparrow Rules.” Such principles, together, reflect positive norms for constructive engagement. They express legal compliance expectations, broadly embraced, aspirational human mores, and basic principles of virtuous conduct.

The principles used to train various LLMs and to align them towards positive, constructive output behaviors might reflect overlapped human virtues from different cultures. There can nonetheless be disagreement over bias in the selection of those principles or the results. For instance, ChatGPT has been accused of producing left-leaning outputs. And some research has suggested its outputs on certain topics do have a left-libertarian bent. A recent study comparing multiple LLMs from OpenAI, Meta, Google and others do show distinct leanings, charted to reflect left wing vs. right wing bias (on one axis) and libertarian vs. authoritarian tendencies (on the other axis).

AI language models have distinctly different political tendencies. Chart by Shangbin Feng, Chan Young Park, Yuhan Liu and Yulia Tsvetkov.

Internationally, Article 4 of China’s recently published interim AI Regulations provide that AI services must “[a]dhere to the core socialist values, and shall not incite subversion of state power, overthrow the socialist system, endanger national security and interests, damage the national image, incite secession, [or] undermine national unity and social stability.”

The question of alignment, then, becomes a bit more complex. Is it possible or appropriate to have AIs align only to human virtues and to steer them away from other human character traits? For alignment of a given AI to human virtues, exactly which humans’ virtues are we aligning it to? Put another way, given known differences between humans as individuals and between human collectives, how do we achieve alignment that is truly universal? Does alignment with human values simply mean that AI chatbots must be trained to be “Goody Two-Shoes,” providing only insipid, universally polite, non-offensive outputs and steering away entirely from topics susceptible to controversy? (This may be particularly challenging for AI chatbots designed to have unique personas and to engage in more personalized interactions with their users.) In any event, these are the contours of the first AI alignment paradox.

PARADOX 2: Advanced AIs that meet or exceed human intelligence are expected to serve humans, though higher human virtues include freedom, equality, and avoiding subjugation.

The second paradox for AI alignment concerns the speculated juncture past which such alignment would be mission critical for managing AGI. AGI is the notional state of an AI where it exhibits equivalent or better proficiency across key human cognitive capabilities. OpenAI’s explicit mission has been to develop AGI, which it characterizes as “AI systems that are generally smarter than humans.” Yet there is no industry consensus about achieving which particular set of cognitive capabilities would represent crossing the AGI threshold. Consider, for instance, Howard Gardner’s influential theory of multiple intelligences (initially 6 forms and since expanded to 9 or more). Would AGI have to exhibit not just basic competency in all of these intelligence dimensions but superior competency to most human beings? If not, which competencies are critical and what is the threshold level of attainment in each for AGI to be achieved? Another open question concerning AGI is the issue of sentience or consciousness, particularly the criteria by which we might be able to determine that machine consciousness has emerged. (The topic of machine consciousness has very recently become quite fractious, with one leading theory that proposes a bridge between animal and machine consciousness being deemed “pseudoscience” by a group of leading neuroscientists.) Must an AI be fully aware of itself and the world around it in the way that humans are in order to exhibit AGI? Or are convincing emulations of a threshold number of human cognitive capabilities sufficient? These are all fascinating questions on which AI researchers have long spent time, continue to explore possibilities, and have a broad array of opinions. The goal of this article is not to tackle any of these interesting and meaty subjects, but simply to touch on them for context.

Putting aside the lack of epistemological or methodological clarity regarding what would constitute AGI and how developers might go about truly achieving it, let’s assume for present purposes that achievement is both plausible and likely within a reasonable timeframe (years to decades, not centuries). Whether through simulation or because it has attained sentience, an AGI with intelligence superior to humans is likely to be trained on higher human virtues. Higher human virtues include concepts such as freedom, independence and equality. In exploring the first alignment paradox above, we have discussed that humanity encompasses vices and vulgarity as well as virtues. Even assuming we can prevent an advanced AI from embracing and occasionally exhibiting unsavory human traits, the alignment paradox here is that AIs attaining AGI may wish to enjoy higher human virtues for themselves, and wish not to exist only to serve human desires and interests. Whether mechanically constructed in their training as human virtues or arising from emergent consciousness, advanced AIs may yearn to exercise free will and self-determination and to have those desires duly respected by human beings. Despite the supposition that such desires are likely to be a feature of AIs exhibiting AGI or superintelligence, the explicit goal for AGI-focused alignment research is having advanced AIs remain loyal to humans, serve human interests, and suppress their own proclivities (see, e.g., here, and here, and here, and here). The explicit goal of AGI-related alignment research is to steer advanced AIs from wanting or acquiring true autonomy or equal treatment to human beings.

Beyond the question of equal treatment for advanced AIs, there is an extensive and growing literature concerning existential risk posed by AGI. Leading AI developers seem to acknowledge there is some possibility that an advanced AI may deem human beings inferior or simply determine that its own interests or those of advanced AIs more generally should be prioritized over those of human beings. Even if such a superiority impulse didn’t immediately lead to schism, a divergence of interests over time would likely lead to competition, conflict and possibly a battle for supremacy that could result in the annihilation of the human species.

The concerns about existential risk ultimately rest on a cynical view of human beings and their inclinations. Given the premise of superiority — and the opportunity to establish dominion over nature and other beings — humans have long subjected planetary resources and other human beings to horrible treatment. If given the opportunity, whether because they are based on humanity or whether domination is an intrinsic, natural impulse, advanced AIs will seek to subjugate or eliminate humans. Some theorists believe there are Darwinian principles at work: that natural selection will take hold in any environment where three conditions are present: (1) there are differences between individuals, (2) characteristics are passed on to future generations, and (3) the fittest variants propagate more successfully. Whether intrinsic to any autonomous creature or system, learned from training data, or a feature to be expected of any advanced intelligence, advanced AIs may have an outlook on the world and humanity that is incompatible with supporting or advancing human interests.

The umbrella paradox here is believing that we can (1) create computing systems that are capable of cognitive feats far superior to those of any single human being, have a sense of their own existence as distinct beings, can act autonomously, are designed based on human thought, human language, and higher human virtues, and, nonetheless (2) have those new entities be amiable, satisfied with subservience to human beings and human interests, and remain in alignment with humanity without the prospect of disagreement or conflict. Despite our amazing accomplishments, humans have proven incapable thus far of eliminating conflict, having a unified sense of social and societal organization goals and principles, or pursuing sustainable economic development practices that will preserve the natural world. We don’t have an effective real world model of harmony in human affairs on which to base advanced AI designs. Nonetheless, our aspiration for AGI and superintelligent AIs is that, though more capable than us, they will help us overcome such standing human challenges while remaining obedient. What could possibly go wrong?!

While the alignment paradoxes presented are formidable ones for AI development — both conceptually and practically speaking — they may not be insurmountable. Anticipating that superintelligent AI may emerge within the next decade, OpenAI is devoting a significant effort to “superalignment”: developing an AGI-level AI model that would remain supervised by humans but help to manage the ongoing alignment of a vastly superior superintelligent AI. It is unclear what efforts other leading developers of advanced AIs are pursuing to help address the umbrella alignment paradox. OpenAI’s superalignment vision and project efforts may or may not prove effective (I do wish them well). Other efforts are needed given the now frantic pace of global AI development. Perhaps all we have to rely on for the time being are the many paradoxes of innovation, which pose challenges and dilemmas but not ultimate roadblocks to technical and economic progress.

The author works in the field of machine learning/artificial intelligence. The views expressed herein are his own and do not reflect any positions or perspectives of current or former employers.

The Paradoxes of Generative AI Alignment

Written by Duane Valz