Sitemap

After being overwhelmed by AI crawlers, Wikipedia has surrendered

Fyren
6 min readApr 29, 2025

Everyone is familiar with Wikipedia. You’ll often see it cited in research sources. Whenever I write articles with a historical or popular science focus, I start by consulting Wikipedia’s explanations, then delve into the references listed at the bottom to uncover more information. In many respects, Wikipedia remains one of the most accessible and authoritative resources for ordinary people to grasp complex concepts.

Wikipedia is operated by Wikimedia, a nonprofit organization that also hosts projects like Wikimedia Commons, Wiktionary, and Wikibooks. All these resources are freely available because Wikimedia’s core mission is to enable universal access to knowledge. However, Wikimedia has recently been overwhelmed by AI companies. To train their large language models, these firms have deployed countless AI crawlers to scrape data continuously from Wikimedia’s platforms.

Surprisingly, instead of pursuing legal action, Wikimedia chose to —
voluntarily surrender the data.
“Dear companies, we’ve pre-packaged all materials for you. Could you please cease your crawling activities?”

Recently, Wikimedia uploaded English and French Wikipedia content to Kaggle, a data science platform, effectively telling AI companies to help themselves. But providing raw data wasn’t sufficient. Wikimedia went further: they specifically optimized the materials for AI model consumption.

Machines don’t process information like humans. While content appears clear and intuitive to us, AI systems require additional structuring to parse each section’s purpose.

To address this, Wikimedia converted pages into structured JSON format, organizing titles, summaries, and explanations into standardized categories. This optimization helps AI systems better understand content relationships and data hierarchy, significantly reducing preprocessing costs for AI companies.

This strategic move resembles Wikipedia preparing a lavish feast for predators and diverting it away from their main servers to prevent operational collapse.

I believe Wikimedia had little choice in this matter. As early as April 1st, they lamented in an official blog post: since 2024, multimedia content downloads on their platform had surged by 50%. Initially attributing this to increased public interest in learning, they discovered — to their dismay — that nearly all traffic originated from AI company crawlers harvesting resources for model training.

The crawlers’ impact on Wikipedia’s infrastructure has been profound. Wikimedia operates multiple regional data centers (in Europe, Asia, South America, etc.) alongside a primary data center in Ashburn, Virginia. The US-based core facility stores all data, while regional centers temporarily cache popular entries.

This distributed system offers significant benefits. For instance, if numerous Asian users search for “Speed,” that entry gets cached in the Asian regional center. Subsequent Asian users then access “Speed” through local infrastructure rather than transoceanic cables from the US core. This intelligent routing — high-demand entries using cost-effective regional channels, less-popular content using premium core channels — enhances global loading speeds while reducing Wikimedia’s server strain.

However, AI crawlers disrupt this delicate balance. They systematically scrape every entry indiscriminately, bypassing popularity-based routing. This forces massive traffic through the expensive core channels. Recent Wikimedia analytics revealed that a staggering 65% of costly US core data center bandwidth was being consumed by AI crawlers.

Remember, Wikipedia’s content is free, but its server maintenance costs $3 million annually.

When protests proved ineffective, Wikimedia ultimately organized its data on external platforms for direct AI company access. Wikipedia isn’t alone in this struggle — from content platforms to open-source projects, personal blogs to news sites, all face similar challenges. Last summer, iFixit’s CEO tweeted that Claude’s crawlers had made over 1 million requests to their site in a single day.

At this point, you might ask: What about the robots.txt protocol? Can’t websites block AI crawlers by setting rules there?

After iFixit added Claude’s crawlers to their robots.txt, the scraping activity did decrease temporarily — to once every 30 minutes. In the early internet era, the robots protocol was considered foolproof, with companies even facing lawsuits for violating it. But today, this “gentleman’s agreement” has become largely symbolic.

Modern AI companies scrape whatever they can access. In this competitive landscape, if competitors are scraping while you abstain, your training corpus becomes inferior, putting your model at a disadvantage. Their solution?

Simply rename the crawler’s user-agent. It’s like banning “Lu Xun” but allowing “Zhou Shuren” (the writer’s real name). Do major AI companies actually employ such tactics? Numerous examples exist. One Reddit user blocked OpenAI’s crawlers via robots.txt, only to have the company modify their user-agent and continue scraping.

Perplexity AI was exposed by WIRED for completely ignoring robots.txt restrictions.

Over time, various countermeasures have emerged. Some embed “honeypot links” in robots.txt — only crawlers would trigger them, as human users don’t parse the protocol.

Others deploy Web Application Firewalls (WAFs) to identify malicious crawlers through IP analysis, request patterns, and behavioral detection. Some implement CAPTCHA systems.

Yet it remains an endless cat-and-mouse game. The more resistance websites show, the more sophisticated scraping techniques become. This prompted Cloudflare, the “cybersecurity guardian,” to develop a new defense: when detecting malicious crawlers, grant access — but feed them corrupted data.

Not actual content, but streams of irrelevant pages designed to pollute AI training datasets.

Cloudflare’s approach seems restrained compared to Nepenthes, a tool launched this January. Named after the carnivorous pitcher plant, Nepenthes traps AI crawlers in a self-replicating maze of dead-end files without exit links, effectively isolating them from real content.

Moreover, Nepenthes generates Markov chain-based nonsense text to poison AI training data. Reports suggest only OpenAI’s crawlers have escaped this trap so far. Clearly, the AI arms race begins at the data collection phase.

Some platforms opt for commercial agreements. Reddit and X (formerly Twitter) offer paid API tiers with monthly data limits. Failed negotiations sometimes lead to lawsuits — The New York Times sued OpenAI after unsuccessful discussions about article scraping.

Why doesn’t Wikipedia pursue legal action? The answer lies in its foundational principles. Wikipedia’s licensing allows free use, modification, and distribution by anyone (including AI companies) under attribution and share-alike terms. Legally, AI companies scraping Wikipedia likely operates within current copyright frameworks.

Even if Wikimedia sued, existing laws lack clear provisions for AI-related copyright infringement. Legal battles would be risky, expensive, and time-consuming — hardly feasible for a nonprofit. More crucially, Wikimedia’s mission emphasizes “empowering global collaboration to develop and share free educational content.” While AI scraping strains resources, restricting access through legal or commercial means would contradict this ethos.

Ultimately, structuring data for free AI access emerges as Wikimedia’s most pragmatic — if somewhat resigned — solution.

--

--

Fyren
Fyren

No responses yet