Beyond Stochastic Parrots 🦜? Understanding Large Language Models

Published in

Electronic Life

10 min readFeb 18, 2023

DALL-E 2 generated image (prompt: Stochastic Parrot, pixelated computer image)

This articles introduces the debate emerging from two papers: Emily Bender et al.’s ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’ and Steven T. Piantadosi, Felix Hill’s ‘Meaning without reference in large language models’.

There are two opposing papers on ‘meaning’ in large language models. The first of these, published in 2021, is Emily M. Bender et al.’s ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’. The inclusion of the parrot emoji is a nice touch, visually encoding a piece of ‘redundant’ information (which in turn ‘uselessly’ enters the worldwide database). The parrot performs the problem at the heart of Bender et al.’s argument, which marks a deep concern about the unsustainable scale of large language models; concerns, then, over the computational energy required to achieve current results. They seek to provide ‘recommendations including weighing the environmental and financial costs’. Of specific pertinence here is their underlying criticism regards the value of large language models, which they argue are merely ‘stochastic parrots’.

A stochastic gradient descent is a common optimisation algorithm used to train machine learning models whereby there is an iterative method of working through billions of possibilities, searching through countless decision-tree like steps until reaching a probable best fit. In high-dimensional problems it reduces the high computational demands. Nonetheless, the training of these models requires massive amounts of computational power. GPT-3, for example, is housed in a complex in Iowa, with 285,000 CPU cores linked together to form a supercomputer, powered by solar arrays and cooled by industrial fans. These machines never stop making calculations. Despite the power of this tech, Bender et al. consider large language models merely to be ‘parroting’ language; simply repeating all of the things already said, such as a parrot copies what it hears.

By contrast, however, Steven T. Piantadosi, Felix Hill’s ‘Meaning without reference in large language models’, published just over a year later, respond directly to skepticism that large language models possess human-like concepts or meanings. They argue that models ‘likely capture important aspects of meaning, and moreover work in a way that approximates a compelling account of human cognition in which meaning arises from conceptual role’. Taking their cue from Wittgenstein, Piantadosi and Hill consider how conceptual meaning in large language models, while not derived from direct references, can emerge through internal reasoning, due to the way concepts in language usage ‘relate to each other’. Their point is that conceptual meaning is defined by the relationships between internal representational states. As such, they note, ‘meaning cannot be determined from a model’s architecture, training data, or objective function, but only by examination of how its internal states relate to each other’. It is for this reason, they contend, that large language models have proven successful. NB. They published their paper before the release of ChatGPT, which has layered the large language model GPT-3 with new contextual learning (based on Reinforcement Learning from Human Feedback), which would seem to further underline Piantadosi and Hill’s argument.

The debate that emerges from reading these two papers is nearly captured in a couple of TikTok videos; firstly from @professorcasey , and then in response @syntheticzero:

https://www.tiktok.com/@professorcasey/video/7197566830175702314

https://www.tiktok.com/@syntheticzero/video/7198047587436383534

In Words and Rules, Steven Pinker describes the ‘staggering power of a combinatorial system’ of language. Despite the billions or even trillions of parameters applied in any given AI large language model, it might be argued these will never be enough to chase after the endless, generative property of language and discourse. Nonetheless, it is worth noting of the finite mathematics involved. Pinker evokes Jorge Luis Borges’s story ‘The Library of Babel’ to flesh out the problem. As the story goes, ‘somewhere in the library is a book that contains the true history of the future (including the story of your death), a book of prophecy that vindicates the acts of every man in the universe, and a book containing the clarification of the mysteries of humanity’. Of course, even after the human species is made extinct, the library (and its combinatorial possibilities) remains. Yet, technically, Pinker explains:

Borges needn’t have described the library as “infinite.” At eighty characters a line, forty lines a page, and 410 pages a book, the number of books is around 101,800,000, or 1 followed by 1.8 million zeroes. That is, to be sure, a very large number — there are only 1070 particles in the visible universe — but it is a finite number. (Steven Pinker, Words and Rules)

The debate that emerges from these numbers in respect of large language models is whether the statistical analysis of language’s vast archive in any way constitutes ‘intelligence’; whether the latest, highly plausible and competent text generation softwares (such as GPT-3) are offering new utterances, even potentially new ideas (see Steven Johnson’s ‘A.I. Is Mastering Language. Should We Trust What It Says?’). Given the numbers involved, what can be argued for, in the context of big data, is a kind of massive structuralist method (albeit a distinction can be made between the analytic pursuit of twentieth-century structuralism as an ends, while the same analytics is more the means for AI-based fabrications). At stake is the potential meeting point of the finitude (albeit of a massive scale), outlined by Pinker, and an unprecedented scaling of computing to match.

Structuralism in its broad sense (prominent as an intellectual paradigm in the 1950–1970s, revealed through its interests in cybernetics, mathematics, and biological pro-gram) can be argued to prefigure the contemporary conditions and methods of AI. Such an account is further presaged in much of the post-structuralist writings. In a well-known essay, ‘Death of the Author’, Roland Barthes describes a text not as ‘a line of words releasing single “theological” meaning (the “message” of the Author-God)’, but rather, as previously cited, as ‘a multi-dimensional space’. He goes on to say:

The text is a tissue of quotations drawn from the innumerable centres of culture. Similar to Bouvard and Pécuchet, those eternal copyists, at once sublime and comic and whose profound ridiculousness indicates precisely the truth of writing, the writer can only imitate a gesture that is always anterior, never original. His only power is to mix writings, to counter the ones with the others, in such a way as never to rest on any one of them. Did he wish to express himself, he ought at least to know that the inner ‘thing’ he thinks to ‘translate’ is itself only a ready-formed dictionary, its words only explainable through other words, and so on indefinitely… (Barthes, ‘Death of the Author’)

There is a structuralist argument to be made suggesting that all language is stochastic parroting; a constant process for simulating and dissimulating, of imitating or approximating as a means to carry meaning. Take one example in which GPT-3 was asked to compose a surrealist fiction (specifically to ‘write a story about a poodle that becomes an expert billiards player.’). It makes a subtle, yet striking remark:

One day, Lulu [the poodle] overheard her owners talking about how they were going to have to get rid of their pool table because they never used it. Lulu knew this was her chance to prove herself. She jumped onto the table and started playing. She was terrible at first, but she kept practicing and soon became an expert player. (cited in ‘A.I. Is Mastering Language. Should We Trust What It Says?’)

The phrase ‘jumped onto’ is easy to miss, not least because it is highly plausible. It appears to recognise practical details (i.e. that a poodle, being a small dog, would need to be on the billiards table, rather than standing against it). This is suggestive of an intelligent reading of the situation. The counter-argument is that if a computer tracks through enough examples (many more than would be humanly possible to read) the stochastic gradient descent would be sufficient to re-produce (with high probability) this plausible line of the dog jumping onto the table (even though it contravenes the rules of pool!). It is an example of how we can read ‘intelligence’ into AI. I.e. A seemingly innocuous line is picked up as ifevidence of human understanding. Yet, as Bender et al. would argue, it is only a form of confirmation bias. Equally, however, the example of the dog jumping onto the table could be taken as a positive example of what Piantadosi and Hill argue in their paper, ‘Meaning without reference in large language models’. Whereby we can understand conceptual meaning in large language models to not need be derived from direct references, but instead internal reasoning. I.e. That concepts in language can be formed relationally, by internal representational states.

John Seale’s (1980) much debated case of the ‘Chinese Room’ presents the idea of an ‘operator’ (human or machine) based solely on syntax without semantics. His argument, as a way of defining artificial ‘intelligence’, is that it is possible to process or calculate meaning (if in possession of a rule-book or algorithm) without mental processing. The argument can be carried forward to contemporary large language models. As noted, Bender et al. describe these models as ‘stochastic parrots’, tirelessly randomising and refashioning ‘all’ of history’s human-authored texts. Gary Marcus and Ernest Davis, in their book Rebooting AI, are equally critical, showing how current technologies reveal a deep mismatch between what machines are good at (i.e. classifying, sorting) and real-world human reasoning and understanding. In short, they note, after six decades of AI research, ‘computers are still functionally illiterate’.

Importantly, Searle’s account marks an important difference between ‘intelligence’ as a form of computation, and cognition as understanding (a matter of ‘mind’). Gary Marcus, on the subject of GPT-3, suggests the model masks an underlying lack of understanding: ‘There’s fundamentally no “there” there,’ he remarks, describing GPT-3 only ‘‘an amazing version of pastiche generation’ (cited in ‘A.I. Is Mastering Language. Should We Trust What It Says?’). Playing AI Dungeon, for example, can certainly give a sense of ‘no “there” there’; none of the teleology that might be expected of classically defined narrative. The fact that current language models are really only parsing a single paragraph of text at one time means that as a ‘player’ in AI Dungeon you can be forgiven for experiencing the ‘narrative’ as a series of seemingly unending oneiric episodes. Although, again, the recent launch of ChatGPT has added notable contextual layers which seem to offer a means to mitigate against there being ‘no “there” there’.

Of course, a deeper, existential question is if in fact humans are equally stochastic; a troubling thought that harboured in the earlier structuralist perspective. Again, Barthes can be quoted, in this case from his essay ‘From Work to Text’, which in many ways pre-figures the emergence of the World Wide Web:

The intertextual in which every text is held, it itself being the text-between of another text, is not to be confused with some origin of the text: to try to find the ‘sources’, the ‘influences’ of a work, is to fall in with the myth of filiation; the citations which go to make up a text are anonymous, untraceable, and yet already read: they are quotations without inverted commas. (Roland Barthes, ‘From Work to Text’)

The idea of the ‘already read’ (another version of ‘no “there’” there’) is embedded in the structuralist pursuit of invariance. Of course, Barthes deploys the notion of the Text as a response to the spectre of a lack of agency within the system of language (a charge that was levelled at structuralism more broadly); giving rise to the ‘role of the reader’. Yet, at a more profound level, when Barthes refers to the Text as plural he is clear it was not simply made up of several meanings or interpretations, but that:

…it accomplishes the very plural of meanings: an irreducible (and not merely an acceptable) plural. The Text is not a co-existence of meanings but a passage, an overcrossing; thus it answers not to an interpretation, even a liberal one, but to an explosion, a dissemination. The plural of the text depends, that is, not on the ambiguity of its contents but on what might be called the stereographic plurality of its weave of signifiers (etymologically, the text is a tissue, a woven fabric). The reader of the text may be compared to someone at a loose end (someone slackened off from any imaginary)… (Roland Barthes, ‘From Work to Text’)

Much of what Barthes captures in this passage can be heard in the murmurings of contemporary AI text generators such as GPT-3, which effortlessly weave signifiers and throw out the results on command, while humans sit potentially at a loose end. Yet, the combinatorial nature of language equally remains a measure by which we test these offerings. Arguably, however, despite the evident gains and advancements we now witness with large language models, there appears an urgent need for those involved in AI development to take note of the broader combinatory powers of language (giving many orders of meaning at once). So, to take note of what Barthes calls here language’s ‘stereographic plurality’.

This article also appears as ‘Beyond Stochastic Parrots 🦜? ’, ‘Notes on Structuralism’ (structuralism.ai)

Reference

Roland Barthes (1982) ‘Death of the Author’, in Image-Music-Text, trans. by Stephen Heath. London: Flamingo, pp. 142–148.

Roland Barthes (1982) ‘From Work to Text’, in Image-Music-Text, trans. by Stephen Heath. London: Flamingo, pp. 155–164.

Emily M. Bender, et al. (2021) ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’, FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.610–623.

Steven Johnson (2022) ‘A.I. Is Mastering Language. Should We Trust What It Says?’, The New York Times Magazine, April 15 2022.

Gary Marcus and Ernest Davis (2019) Rebooting AI: Building Artificial Intelligence We Can Trust. New York: Pantheon Books.

Steven T. Piantadosi and Felix Hill (2022) ‘Meaning without reference in large language models’, Arxiv (Computer Science).

Steven Pinker (2015) Words and Rules: The Ingredients of Language. New York: Basic Books.

John Searle (1980) ‘Minds, Brains and Programs’, Behavioral and Brain Sciences, 3: 417–57.

Beyond Stochastic Parrots 🦜? Understanding Large Language Models

Published in Electronic Life

Written by Sunil Manghani