7. Cultural Alzheimer: 404s and the Unstoppable Linkrot

The very short lifespan of online content and resources

Published in

Content Curation Official Guide

12 min readAug 1, 2016

“At first glance, the wealth of information we create today should be a boon to archaeologists and historians of the future: they should have no problem to understand who we were and what we thought. After all, we document our lives in countless ways — from movies to blogs to podcasts like this one.
But nothing is as simple as it seems. As we shall soon see, the wealth of digital information we produce — thousands upon thousands of exabytes — will create new and unique challenges to these future archaeologists.”
(The Domesday Book — Curious Mind)

Although there is widespread agreement that the loss of the Library of Alexandria marked a very dark moment for the cultural heritage of this planet, we do not seem much concerned today about the fact (and not simply the probability) that a very significant part of our digital content will be forever lost in a few years from now.

Notwithstanding the vast quantity of new information being published online daily, there is in fact a huge, growing amount of valuable content that is also literally disappearing, day after day, from the Internet.

Consider this for example: An estimated 44 percent of Web sites that existed in 1998 vanished without a trace within just one year (WashingtonPost).

Due to this phenomenon, known as linkrot, lots of online valuable information becomes inaccessible. It is forever lost.

Given all the good things that our culture derives from curating content, and the awareness of the flimsiness of digital content and the ease with which it can get lost, what is being done to preserve curated content for the long-term future?

Unfortunately, very little or nothing is being done on this front. Though there exists specific initiatives and organizations devoted to this, like national libraries such as the British Library, the U.S. Library of Congress and the Internet Archive, they are yet very distant from having the resources and technology to be able to preserve all that is relevant.

And one key reason why they are not yet capable of preserving all that is of relevance, is that there is no one suggesting where the good stuff is.

As a matter of fact, while we give for granted that anything saved or published online is there to stay forever, we have ample proof that this is not the case at all, and that we gradually lose a great chunk of the information artifacts we create, publish and share online.

It should be a responsibility of the whole human civilization to preserve our digitized information in a safe and reliable matter or we risk losing much of our history, knowledge and data.

And that’s where curation plays a very important role.

Digital curation own mandate includes the “selection, preservation, maintenance, collection and archiving of digital asset” in ways and with technologies that can endure the test of time.

But until preserving reliably digital content becomes the norm, it is worthwhile to realize how widespread this phenomenon is and how important it is to stop it.

Source: https://twitter.com/worrydream/status/478087637031325697

What Is Exactly Linkrot?

Linkrot “also known as link death or link breaking, describes the process by which hyperlinks (either on individual websites or the Internet in general) point to web pages, servers or other resources that have become permanently unavailable”.(Wikipedia)

Why does it happen?

The reality is that content disappears for many reasons: much is moved to different online addresses and becomes difficult to find, some is censored, taken down for copyright or legal reasons, some goes down because the author / publisher does not properly maintain his website. Some is lost to malicious attacks, some of it goes offline because there are no economic resources to maintain it.

Furthermore the evolution and changes to file and hardware formats and standards, makes it all the more difficult to access, read files and documents that are only 20 to 30 years old (think for example of 5 ¼ floppy disks, or about the tons of one inch analog videotape used in television studios until the 80s). How can you read and access all of that stuff unless you digitize it?

Although you may never heard about it, this phenomenon is so big and pervasive that an official name has been given to it: Linkrot. It signifies the rotting of web links that go bad due to one or more of the reasons listed above.

Linkrot impact is not marginal as different studies and research reports indicate that it can account for up to 30% or more of all the documents published online.

In addition to this, nobody has any certainty about the future of the content sharing platforms where we publish and share much of our content. We don’t know whether they will remain alive, independent or whether they will restrict or charge for accessing content, be bought, closed down, or be controlled by larger entities or even by governments.

The main causes behind linkrot are:

content being moved and relocated without appropriate redirection mechanisms in place
content being deleted “after the fact” due to publishing or
editorial decisions
legal or copyright connected issues
change of domain name and URLs
content being blocked by censoring or restrictive local government filters
content being blocked or made inaccessible by corporate firewalls
unpaid hosting fees
sites being abandoned for lack of economic resources
accidental expiration of domain name
owner of site whether human or corporate dies, gets bought, files bankrupcy
human errors in typing links

Hard Facts

One study conducted by the journal Science reports that 13% of Internet references in scholarly articles were inaccessible after only 27 months.

See: Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Graber M, et al. Information science. Going, going, gone: lost Internet references. Science 2003 Oct 31;302(5646):787–788. DOI:10.1126/science.1088234

The Chesapeake Digital Preservation Group has found that of the original dataset of websites it began working with in 2008, “the content at dot-gov domains showed the highest increase in link rot.

More than 50 percent of the material posted to government domains disappeared from the original documented Web addresses,” according to the 2013 study.

The New York Times reported half the links referenced in Supreme Court opinions were victims of link rot. But the rest of the federal government and state governments are losing data, too.

See: 44 Percent of URLs from Original Data Set (2008) No Longer Work
(“Link Rot” and Legal Resources on the Web: A 2013 Analysis, supra note 8.)

“Link Rot” and Legal Resources on the Web: A 2013 Analysis by the Chesapeake Digital Preservation Group (PDF) 2013

“Unfortunately and disturbingly, the Supreme Court appears to have a vast problem with link rot, the condition of internet links no longer working. We found that number of websites that are no longer working cited to by Supreme Court opinions is alarmingly high, almost one-third (29%). Our research in Supreme Court cases also found that the rate of disappearance is not affected by the type of online document (pdf, html, etc) or the sources of links (government or non-government) in terms of what links are now dead. We cannot predict what links will rot, even within Supreme Court cases.”
Source: http://yjolt.org/something-rotten-state-legal-citation-life-span-united-states-supreme-court-citation-containing-inte

In a recent study looking at academic references, Zittrain, et al. (2013) discovered that over 70 percent of all web links inside academic publications had gone broken. The same thing had happened to 50 percent of U.S. Supreme Court opinions. After six years, nearly fifty per cent of the URLs cited in those publications no longer worked.

In another study conducted in 2014 at the Harvard Law School it was reported that “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.” (source)

Cultural Heritage Rests on Shaky Pillars

Our cultural heritage rests on very shaky pillars, as we let the digital backup strategy and infrastructure put in place by these social media sharing and content curation platforms dictate the future lifetime of much of our cultural heritage.

While the companies we use to collect, publish and curate information today do have interests in making sure none of their data will be ever lost, they do not seem to be driven by humanistic ideals, but rather by what Wall Street and their stakeholders dictate.

More than anything, these companies do not seem to be even aware of holding such great cultural responsibility, and as a natural consequence they are not actively worrying about it.

In such a situation, how much trust can we place in these companies as reliable gatekeepers of our cultural heritage?

Given the not so remote possibility of a future cataclysmic event, capable of wiping out most of our present-day civilization and technology, there is little hope that whatever survives through it, could be accessed and read by future generations or by intelligent beings from other galaxies.

But this is where we should put more of our energies, research and attention.

Possible Future Strategies To Save Our Cultural Heritage

There exist at least a few alternative routes of action that could be immediately taken to help preserve our cultural heritage:

a) increase public awareness on the flimsiness of digital content, and the need to continue to improve technology and tools specifically designed to help us preserve it for the longest time.

b) increase public appreciation for the value of preserving our cultural heritage, of its importance, value and of the consequences of when it gets lost forever.

c) support and incentivize both government and individual-born activities that strive to collect, organize, and preserve information artifacts of significant value for society. Empower more organizations and individuals to contribute to the finding, vetting, organizing and adding value to valuable information artifacts.

d) create and maintain multiple, redundant indexes for all of the updated collections available out there. A directory of culturally-relevant curated directories, so to say. (Such curated collection of collections should be completely distributed, and not secured in one single place, easily replicable from device to device, continuously updated (but with a full record track of all the changes made to it).

Technology solutions that would help in this direction would be those that could enable:

a) cloning and replication vast amount of data locally,

b) online access via our own distributed resources even when there was no Internet (by utilizing our own friends network), and

c) a way to physically store and preserve such valuable content for very long periods of time even when in the presence of harsh or extreme climate conditions. (Crystals and holographic memory may be some of the solutions we may consider soon.)

d) accessibility to this archived information by future generation of computers and intelligent machines.

Three Practical Solutions

The good news on this front is that we have at least three practical solutions to this major issue already available in our hands.

We just need to test and experiment with them, while making them available to every human being on the planet.

The first one is to start to seriously curate content, beyond simple republishing to actual preservation and archival.

The second one is to develop a federated wiki of web sites: “In a federated wiki, when you find a page you like, you curate it to your own server (which may even be running on your laptop). That forms part of a named-content system, and if later that page disappears at the source, the system can find dozens of curated copies across the web.”

There are many pros and cons to this approach but it is certainly worthwhile looking further into it.

The third one is biological. DNA will likely be our technological savior. DNA is in fact the perfect medium for preserving and archiving information for tens of thousands of years.

Just one of gram of DNA can hold up to two gigabytes of information and in 2011 a group of scientists has reliably demonstrated how to store, archive and retrieve dozen text, audio, images and video files from a DNA molecule.

See:

Digital files stored and retrieved using DNA memory
Physics World
Scientists Stored These Images in DNA — Then Flawlessly Retrieved Them
Gizmodo

Consequences

Wikipedia reports: “To combat link rot, web archivists are actively engaged in collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public.
The largest web archiving organization is the Internet Archive, whose goal is to maintain an archive of the entire Web, taking periodic snapshots of pages that can then be accessed for free.”
Source: https://en.wikipedia.org/wiki/Link_rot

Much of the content that we publish online is likely to disappear within a relatively short amount of time. This has disastrous consequences for our ability to inform ourselves, for journalism, history and cultural heritage.
The problem is even bigger in as much as we are a) overloaded by greater and greater quantities of info, b) we do not realize how much precious information we are losing.

Opportunities

The overall emerging opportunity is the one of not just identifying and preserving valuable content before it gets lost but to actually organize and make sense of such resources in order to increase their value and benefit to the general public.

1) Opportunity for individuals and organizations to “preserve”, “archive” and organize key valuable content and resources before they are moved, deleted, abandoned or lost. (E.g.: Oldversion.com)

2) Opportunity for new tools and services that focus not just on collecting and organizing valuable existing content but also in preserving it in a reliable, everlasting fashion.

Resources

Tools and web services designed to help avoid linkrot and to store/archive digital content indefinitely.

Archive — a personal version of the Internet Archive — Wayback Machine, allowing anyone to permanently archive any public web page.
Perma.cc and Permamarks.net are two commercial services specifically devoted to create a permanent copy of any page or document, so that it can be referenced without fear that the original will be moved, deleted, censored or taken down.
Amberlink.org
Free plugin for WordPress developed by Berkman Center creates a backup copy of any outgoing link from your website, so that if the site/page goes down or becomes inaccessible your readers can still see its contents.
Pinboard
For a very modest yearly amount Pinboard offers an archiving service which saves a copy of everything you bookmark, gives you full-text search, and automatically checks your account for dead links.
Permanent Web Archiving Tools

*Note: The key issue with these services is that most do not seem to be exempt from the key variable that makes them all as vulnerable as any other publishing or social media service online: business permanence (their ability to remain alive as a business in the future, and their ability to find ways to permanently store such data on physical supports that can be accessed and used even without the Internet) .