For Web pages, life is short: an estimated 100 days on average. When those pages die, people trying to visit them see a “404 — Not Found” error, and that’s usually the end of the story. But what if we could bring dead Web pages back to life by storing clones in the cloud? That’s exactly what the Internet Archive is doing.
The Internet Archive’s Wayback Machine stores archival copies of 378 billion URLs. It’s a gigantic cache of Web pages that stretches back to 1996. While it doesn’t include every page of every Web site out there—site owners can opt out and even request that old pages be deleted—coverage is surprisingly good. (For example, medium.com’s cache dates back to 1997.) The Wayback Machine’s storehouse is often relied upon in times of crisis: during the recent U.S. government shutdown, the Federal Trade Commission shuttered its Web site and pointed visitors to the Wayback Machine’s archive.
It’s not just the FTC that’s affected by broken pages during a time of crisis, however. A recent Harvard study found that 49 percent of links in Supreme Court opinions are now dead, and 70 percent of links in journals, including the Harvard Law Review, are dead too. It’s hard to study the past when so much of it just rots away.
But pages that have disappeared from Web servers may still persist in the Wayback Machine. Those archival copies can be served up in place of the dead ones. It’s just a matter of connecting people to the archives. Alexis Rossi, head of collections at the Internet Archive, explains, “What we’d really like to do is have the browsers themselves build something into the 404 page that [checks] automatically. That takes a little bit of convincing.” In the meantime, there’s a Chrome extension that does the trick with a few extra clicks.
Rossi’s team is also working with members of the Wikipedia community to prevent broken links before they go dark. “We parse out all the new links that get added to Wikipedia and then we crawl those immediately,” she says. This ensures a cached copy is archived at that moment.
In addition to preventing new broken links, Wikipedian Kunal Mehta has written a bot to archive links before they break (and automatically fix them when they do); this is crucial because there are over 125,000 broken links on Wikipedia today. A similar effort is underway to fix broken links on WordPress.com hosted blogs. There’s also a new “Save Page” feature in the Wayback Machine allowing on-demand archival — this is a one-click method to create a stable URL for use in your dissertation, print article, or Supreme Court decision.
Although a globally recognized fix for broken pages is still on the way, for the technically inclined there are some tools available today. First is the Wayback Machine API, an interface that developers can write software to query as to whether a given URL is available in the archive; if so, they can retrieve it.
Third is the WordPress Broken Link Checker Plugin, which monitors your WordPress blog for broken links, and fixes them when they break. Finally, there’s Memento, a suite of tools (and protocols) to locate past versions of any page; this last one includes that Chrome extension and much more.
“The Internet echoes with the empty spaces where information used to be,” Rossi said as she announced the Internet Archive’s work to fix broken links. The echoes are quieter now, as those spaces are filled in by archivists.
Chris Higgins writes for Mental Floss, This American Life, and The Atlantic. He was writing consultant for Ecstasy of Order: The Tetris Masters. His new book is The Blogger Abides: A Practical Guide to Writing Well and Not Starving.
This article was produced by The Magazine, an electronic periodical that commissions original articles and essays. We publish regularly at Medium, and produce an issue of five long-form features every other week. A subscription to our issues costs $1.99 per month for two issues or $19.99 per year for 26, and includes free access to over 160 past articles — our full archive. You can get a free, seven-day trial via our iOS app or our Web site to try us out.