Why Human Input Matters to Generative AI

Wikipedia is a case study for human/machine interaction. Will that change with AI?

MIT IDE
MIT Initiative on the Digital Economy
7 min readOct 4, 2023

--

Photo by Igor Omilaev on Unsplash

By Irving Wladawsky-Berger

So far, Wikipedia has had a pretty good run. Despite some ups and downs over the past two decades, the English version of the online encyclopedia, the largest and most widely used, holds more than 6.7 million articles worldwide and a base of more than 118,000 active editors. It is the de facto source of information for humans as well as search engines and digital assistants.

Will that all change with the rise of generative AI? I don’t think so.

“In early 2021, a Wikipedia editor peered into the future and saw what looked like a funnel cloud on the horizon: the rise of GPT-3, a precursor to the new chatbots from OpenAI,” wrote author and journalist Jon Gertner in “Wikipedia’s Moment of Truth,” a recent NY Times Magazine article. “When this editor — a prolific Wikipedian who goes by the handle Barkeep49 on the site — gave the new technology a try, he could see that it was untrustworthy. The bot would readily mix fictional elements (a false name, a false academic citation) into otherwise factual and coherent answers. But he had no doubts about its potential.”

I’ve long been relying on Wikipedia as my go-to website for looking up topics I want to learn more about, as well as using links to Wikipedia articles as references in the weekly blogs I’ve been posting since 2005.

More Than an Enclyclopedia

Over the past decade, Wikipedia has become much more than an encyclopedia — “a kind of factual netting that holds the whole digital world together.” Google, Bing, Siri, Alexa, and other search engines and digital assistants often rely on Wikipedia for the information needed to answer users’ questions.

More recently, Wikipedia has been one of the largest sources of the data — currently estimated at around 3% to 5% — for training Large Language Models (LLMs) and related chatbots. Wikipedia has played a major role in the digital world because its large amounts of data are free, easily accessible, high quality, and well-curated.

After trying ChatGPT in 2021, Barkeep49 wrote an essay with the ominous title “Death of Wikipedia,” in which he speculated how Wikipedia might lose its place as the preeminent English language encyclopedia. “For some other encyclopedia to usurp Wikipedia it would have to be able to match some of the advantages we have built up over years. Namely that we have millions of articles and that the articles that people care most about get updated pretty quickly.”

He added that it was unlikely that a successor encyclopedia would have Wikipedia’s commitments to transparency, non-commercial objectives, and the free reuse of its content through its highly permissive license. “Those values have helped our reputation and created tremendous value for readers around the world and I think the world would be a worse place with an encyclopedia which didn’t share those values.”

Is the End Near?

“I don’t think it’s likely that humans could catch up to us. But artificial intelligence could. AI is currently improving exponentially and in January 2021 as I write this, it is already capable of writing some pretty competent text.” While optimistic in the short term, Barkeep49 said that

he feared that longer term AI could displace Wikipedia and its human editors, just as Wikipedia had supplanted Britannica.

“Within the Wikipedia community, there is a cautious sense of hope that AI, if managed right, will help the organization improve rather than crash,” said the NYT Magazine article. “But even if the editors won in the short term, you had to wonder: Wouldn’t the machines win in the end?”

There have been predictions of the end of Wikipedia since it was founded in January of 2001. In anticipation of Wikipedia’s 20th anniversary, Northeastern University professor Joseph Reagle wrote “The many (reported) deaths of Wikipedia,” a historical essay that explored how its death had been repeatedly predicted over the past two decades, and how Wikipedia found ways to adapt and endure.

Reagle noted that in its early years, Wikipedia’s critics as well as its founders exemplified three ways of thinking about the future. First, they looked to similar encyclopedia projects to get a sense of what’s feasible and learned that even well-funded, established projects, like Microsoft’s Encarta, had failed to create a sustainable online encyclopedia. In addition, they assumed that the very tough first six months of Wikipedia would the norm for the next seven years.

“The only model people didn’t make use of was exponential growth, which characterized Wikipedia article creation until about 2007,” wrote Reagle. In its first year, Wikipedians hoped to one day have 100,000 articles, which would have been a bit larger than most print encyclopedias. They estimated that if they produced 1000 articles a month, they would get close to the target in seven years.

As it turned out, in September of 2007 English Wikipedia reached two million articles — 20 times the original estimate.

By 2009, it became clear that new article growth in English Wikipedia had slowed or plateaued, and that the balance of activity was increasingly tilting toward experienced editors instead of continuing to attract new editors. The number of active editors fell from a peak of 53,000 in 2007, to around 30,000 in 2014. Can Wikipedia survive? asked a 2015 NYT opinion piece wondering if Wikipedia’s problems were being compounded by the explosive growth of smartphones, where editing Wikipedia’s articles was more difficult than in laptops.

“Nonetheless, it appears that the number of active editors has been stable since 2014, never dropping below 29,000, and that this pattern of fast growth and plateau is not unusual for wikis,” wrote Reagle.

Photo by Andres Siimon on Unsplash

“The only prediction that I’d hazard for the next ten years is that Wikipedia will still exist,” he added, “The platform and community have momentum which no alternative will supplant. And by then, the Wikimedia Endowment, started in 2016, should have raised its goal of a $100 million toward maintaining its projects ‘in perpetuity.’ The English Wikipedia community will no doubt face challenges and crises as it always has, but I don’t foresee anything so profound that only a husk of unchanging articles remains.”

As of September 2023, English Wikipedia has over 6.7 million articles, and over 118,000 active editors — that is, editors that have made one or more edits in the past 30 days.

The Role of Humans

And, according to Gertner’s NYT Magazine article, “Wikipedia now has versions in 334 languages and a total of more than 61 million articles. It consistently ranks among the world’s 10 most-visited websites yet is alone among that select group (whose usual leaders are Google, YouTube and Facebook) in eschewing the profit motive.”

But, the most critical value of Wikipedia to generative AI is the fact that its knowledge is created by humans.

“The new AI chatbots have typically swallowed Wikipedia’s corpus,” wrote Gertner. “Embedded deep within their responses to queries is Wikipedia data and Wikipedia text, knowledge that has been compiled over years of painstaking work by human contributors.” After a conference call with a number of Wikipedia editors, he added: “One conclusion from the conference call was clear enough: We want a world in which knowledge is created by humans. But is it already too late for that?”

Ensuring that generative AI systems are trained with content painstakingly created by humans is far more than a case of anti-AI human idealism.

It turns out that without human-generated training data, AI systems will inevitably malfunction.

A research paper published in May, 2023 defined this phenomenon in great detail and named it model collapse.

I found a simpler explanation of model collapse in a recent TechTarget article “Model collapse explained: How synthetic training data breaks AI.”

“Model collapse occurs when new generative models train on AI-generated content and gradually degenerate as a result. In this scenario, models start to forget the true underlying data distribution, even if the distribution does not change. This means that the models begin to lose information about the less common — but still important — aspects of the data. As generations of AI models progress, models start producing increasingly similar and less diverse outputs.”

“Generative AI models need to train on human-produced data to function. When trained on model-generated content, new models exhibit irreversible defects.

Their outputs become increasingly ‘wrong’ and homogenous. Researchers found that even in the best learning conditions, model collapse was inevitable.”

“Model collapse is important because generative AI is poised to bring about significant change in digital content. More and more online communications are being partially or completely generated using AI tools. Generally, this phenomenon has the potential to create data pollution on a large scale.

Although creating large quantities of text is more efficient than ever, model collapse states that none of this data will be valuable to train the next generation of AI models.”

This blog first appeared September 28 here.

--

--

MIT IDE
MIT Initiative on the Digital Economy

Addressing one of the most critical issues of our time: the impact of digital technology on businesses, the economy, and society.