Recap: IEAI Hosts On the Dangers of Stochastic Parrots with Emily M. Bender

Institute for Experiential AI
10 min readJan 3, 2022

by Tyler Wells Lynch

The Institute for Experiential AI welcomed Emily M. Bender, the Howard and Frances Nostrand Endowed Professor of Linguistics at the University of Washington, to speak about the risks associated with large language models in the field of natural language processing. The lecture is part of IEAI’s Distinguished Lecturer Series. Watch the full replay or read on for an event summary.


Language Models (LMs) represent a bridge between the hard world of data and the multi-layered world of human language. They’re used in a variety of contexts: Machine translation, speech recognition, handwriting recognition, and writing assistants are a few. For some experts, LMs reflect the next frontier in Artificial Intelligence, with exciting new Large Language Models (LLMs) capable of composing nuanced essays and linguistic spectacles without any human oversight.

However, as Emily M. Bender, points out, these cutting-edge systems also come with some highly consequential flaws. Those flaws were detailed in a 2020 paper (“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”), which the following lecture summarizes.

The paper was co-authored by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell, with contributing work from Vinodkumar Prabhakaran, Mark Diaz, and Ben Hutchison. Aside from Bender and McMillan-Major, all researchers were employed at Google when the work was done, but the findings were apparently so inconvenient to Google’s business interests that the company requested the paper be withdrawn or that the names of its employees be removed. Objecting to the request, Timnit Gabru was shortly forced out of Google, stirring a public controversy that helped to elevate the issues raised in the study.

A Brief History of Language Models (LMs)

What is a language model, really? At its core, it’s a system trained to do string prediction. In this context, you can think of a string as a sequence of symbols or letters, like an alphabet. Given a bunch of words that are the prefix of the string, the LM’s task may be to predict what comes in the next slot. So the data in this system is just a bunch of text in the language being modeled, and the training objective is to correctly fill in the missing words.

While the concept dates back to the early days of computer science, it took decades to implement the most rudimentary forms of language modeling, the earliest successes being Automatic Speech Recognition (ASR) and machine translation (MT) in the early 1980s.

Over time, the size of the training data exploded and the architectures used to represent patterns changed. In the 2010s, neural networks became the dominant architecture. Only within that last couple of years have transformers taken the mantle, preferred for their ability to better contextualize input sequences.

Throughout this period, LM researchers discovered a fundamental ceiling to the scoring metrics used in assessing language models. Generally, more data and bigger models lead to more sophisticated capabilities and, thus, higher scores — but only to a point. When a model hits that ceiling, researchers move on to new architectures, opening up new opportunities for improvement with more data, exploiting those systems until the next ceiling is reached.

As models get bigger the range of applications expands proportionately. This is represented most strikingly by parameter sizes, which can be understood as the “units” of complexity in an LM. More accurately, they are the historical inputs or training data that a model attempts to optimize. Parameters are correlated tightly with the sophistication of the language model. (For example, in a model that tries to predict what TV show you’ll watch tonight, the parameters would be the shows you’ve watched in the past.)

In short, LMs grew from rather specific applications, based on relatively small training sets and only a few hundred million parameters, into enormous training sets used in many different language technologies. For reference, an LM created by OpenAI called GPT-3 boasts 175 billion parameters, and DeepMind recently unveiled another transformer language model with 280 billion parameters.

Risks: Environmental/Financial

Is bigger better? Does size necessarily contextualize data enough to make recommendations and predictions that are ethically sound? To answer that, let’s consider the risks associated with LLMs.

The first is environmental. The average global human is responsible for five tons of CO2 emissions per year. Compare that with the emissions footprint for a typical LM system. A life cycle assessment conducted by the University of Massachusetts determined that a large AI model will emit nearly five times the lifetime emissions of a typical American car.

There are also financial risks. According to Bender’s paper, a 0.1 percent increase in BLEU score, which is a metric used to assess machine translation, costs roughly $150,000 in computing power.

Such huge barriers raise questions about who gets to “play” within this space. Who is getting left out if the main way of doing research is prohibitively expensive and data-intensive?

To the extent that this technology is beneficial, it redounds to those who already have the most in society. In the context of language, that means English and a few other high-resource languages. Meanwhile, marginalized communities around the world are the first to feel the impacts of climate change, and they are also unlikely to enjoy the benefits of LLMs because these systems are not built for, for example, Dhivehi or Sudanese Arabic.

To mitigate these impacts, we can turn to renewable energy, but even sustainable sources can incur environmental costs and their use is far from unlimited. Some researchers have pushed for greater computationally efficient hardware and improved documentation of carbon metrics. Others have pointed out that using green energy for computational purposes risks displacing other, potentially more important uses onto non-green sources.

Risks: Unmanageable Training Data

Large datasets are not necessarily diverse, but they are often assumed to be because they are so large. The internet, for example, is a large and diverse place, so it’s easy to think of it as broadly representative of the way people view the world. But there are many factors that narrow online participation and discussion, which by extension narrows the text likely to be contained in a web crawl.

In these cases, we find that it’s the voices of people most likely to hew to hegemonic viewpoints that are the most likely to be retained. To understand why, consider who has access to and contributes to the internet. It’s mostly young people from developed countries — already a narrowed sample set. Content moderation has further been shown to disproportionately silence the voices of marginalized people, especially black women.

The content that gets scraped from the web via crawling methods also favors certain voices. For example, Reddit is mostly young men. Wikipedia editors are only 8.8–15% women or girls. And websites with fewer incoming and outgoing links, such as blogs, are similarly less likely to be included.

The reasoning behind some exclusion procedures makes sense on the surface. For example, many filtering lists target for exclusion those sites that are pornographic or hateful in nature. It’s not necessarily a bad idea to remove toxic content from training data, but those same filters also tend to exclude LGBTQ online spaces where people are talking about lived experiences in positive ways.

We again see that we are losing access to positive portrayals of marginalized identities within these datasets. The result is an overrepresentation of hegemonic viewpoints in LM training data — a trend that gets fortified by the audiences who are included. People who hold hegemonic views are more likely, consciously or not, to use language in ways that are consistent with systems of oppression, and therefore more likely to be included by web crawlers and datasets.

The challenge of overcoming these biases is baked into the design of language models. The social world is by nature fluid and adaptive, responsive to global inputs through the common use of language. But LMs are only ever trained on a snapshot of language up until a particular point. The training data is historical, not contextual. Therefore, LMs run the risk of creating a “value lock,” in which older, less-inclusive understandings are reified.

Adding to the problem, automated de-biasing procedures may themselves be unreliable in predictable ways. For example, automated toxicity detection systems have been shown to mislabel nonstandard English, African-Americans in particular, as toxic when it is not. Probing LMs to uncover biases requires knowing relevant social categories that the LM may be biased against for all the above reasons. Could it be that these technologies simply aren’t up to the task of a global, one-size-fits-all deployment? Can someone working in the U.S., for example, really understand what the relevant social categories are in Myanmar?

Bender argues for greater local input before the deployment of certain LMs. When starting a project, she recommends budgeting for documentation and only collecting as much data as can be documented. The context of data collection allows reviewers to understand sources of bias and to develop potential mitigating strategies.

Risks: Research Trajectories

Benchmarks are important for assessing progress in language modeling, but the culture around them is too often narrow and gamified. Particularly with regards to Natural Language Understanding (NLU), there’s a competitive focus on achieving the latest benchmark. But, as we’ve discussed, LMs have been shown to excel on spurious datasets. And what is the value in training enormous models on enormous datasets just to show that they can bulldoze the benchmarks?

Deeper questions reside in the nature of understanding. Because LMs are trained only in linguistic form without access to meaning (as we understand it), they cannot be said to perform natural language understanding, regardless of what their leaderboard scores show. There’s just no ghost in the machine.

Stochastic Parrots

We talk about language as if it inherently contains meaning that can be conveyed from one person to another. But modern linguistic theory posits that language hinges on, what Herbert H. Clark calls, “joint activity.” This is where interlocutors, each with their own communicative intent, work together to achieve a shared understanding. Listeners try to work out what that understanding is, and language is just one cue among many that communicate intent.

Now contrast that with a Language Model, which is a system for haphazardly stitching together linguistic forms from its vast training data, without any reference to context or meaning. That’s where the term “stochastic parrots” comes from. Parrots mimic sounds, but they don’t understand what they mean. LMs may haphazardly output form, and they’ve gotten pretty good at making forms that look plausible. But it’s still the human being, encountering synthetic text, who makes sense of it. The computer is merely making a pattern that the human then applies meaning to.

It’s not hard to see how this kind of synthetic speech, when encoded with stereotypes or subtle forms of denigration, can do harm to readers. Bystanders, too, may be unknowingly harmed as their own stereotypes are reinforced through interaction with LMs.

Kris McGuffie and Alex Newhouse have shown how GPT-3 can be used to create synthetic text for extremist message boards and recruiting sites, the effect being to make visitors feel like there are a lot more like-minded people in the community than there really are. Similarly, incorrect translations or biased text can lend the impression of authenticity when filtered through fluent, grammatically sound language.

A host of other harms have been documented, including LMs that can be probed to replicate training data for personally identifiable information (PII). In these scenarios, LMs may not recognize data as PII, but they can nonetheless be manipulated to output it. In all cases, data and queries are rendered less reliable, as specific collections of data are returned with specific incentives that are independent of user or stakeholder interest.

Risk Management Strategies

Bender offers a number of avenues for mitigating these harms. One is to be more cognizant of the way research time is allocated. When planning and evaluating models, organizations should consider energy and compute efficiency while selecting datasets more intentionally. To quote Vinay Uday Prabhu and Abeba Birhane, “Feeding AI systems on the world’s beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy.”

Organizations also need to document their processes and their motivations. Who are the potential users and stakeholders of an LM, and can the system be designed to support the values of downstream users? A sort of “pre-mortem analysis,” in which developers consider worst-case scenarios and unanticipated causes, would further allow them to work backward in figuring out how problems come about.

Of course, there are also risks to abandoning LLMs. As Bender asks, “What happens if we actually have the effect we wanted in writing this paper?” What if the field backs off and takes a broader view of the kinds of systems we might build, as opposed to rushing headlong into an uncertain, unethical domain?

One area of impact would be speech recognition and auto-captioning. But, as Bender asks, are LLMs the only way to deploy these benefits? Many low-resource languages — those without terabytes of text data — have seen progress in deploying these technologies in leaner ways. That’s thanks to the same energy and time constraints that tend to exclude low-resource languages from LLMs. As Bender says, “I’m optimistic that large language models are not the only way to get these benefits.”

Following her lecture, Bender went on to answer questions about the next steps for language modeling and the alternative research paths that are available. Read the Q&A.

About Emily M. Bender

Emily M. Bender is an American linguist who works on multilingual grammar engineering, technology for endangered language documentation, computational semantics, and methodologies for supporting consideration of impacts language technology in NLP research, development, and education. She is the Howard and Frances Nostrand Endowed Professor of Linguistics at the University of Washington. Her work includes the LinGO Grammar Matrix, an open-source starter kit for the development of broad-coverage precision HPSG grammars; data statements for natural language processing, a set of practices for documenting essential information about the characteristics of datasets; and two books which make key linguistic principles accessible to NLP practitioners: Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax (2013) and Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics (2019, with Alex Lascarides).