The Archivist’s Blues

Just as the known universe explodes before our eyes it may be vanishing irretrievably from sight

King Features Weekly
Local and Thriving
Published in
10 min readApr 16, 2015

--

By David Cohea

Ask how much information is out there and your head will explode so many times that the question itself becomes a lost byte we may never recover.

The numbers don’t just boggle, they googol-plex:

  • The sum of information from the dawn of human time to the year 2003 was once calculated to be 5 Exabytes (1 billion gigabytes). By 2010, that amount of information was calculated to double every two days.
  • In 2009, the entire World Wide web was estimated to contain close to 500 Exabytes. In 2013, it was around 4 Zettabytes, or 4 trillion gigabytes.
  • With the coming advent of the Internet Things, the volume of human knowledge is predicted to double every six minutes.
  • Ray Kurzweil says the rate of increase in our knowledge is so fast that the 21st century will see not 100 years but 20,000 years of progress within its span. By the Year 2050, human knowledge will be a quadrillion (that’s one thousand million million) times more advanced than it is now. He also predicts we are headed for a Singularity where artificial intelligence — the mind capable of absorbing all of this information — will accelerate past our human comprehension.

Indeed, we may be observing the birth of another Big Bang — a digital universe whose reach may dwarf known space the way an ocean exceeds each wave.

***

Most strange of all, though, is that the present explosion is occurring in a vacuum of time where the medium of the past fades in white noise. In such an immediate media as the Web, last minute’s news might as well be last year’s. The apotheosis of Now erases both our comprehension of how we got here and what it is that we’re headed to.

And the actual record is vanishing fast — not because we want it to, but simply because we have not sufficiently developed the interest and means to preserve it.

As Google vice president Vint Cerf recently said, “We are nonchalantly throwing all of our data into what could become an information black hole.” The main reason is that that the programs or platforms needed to view them are becoming obsolete so quickly. E-mails, photos, messages, blogs, tweets, videos — all that stuff floating around out there, a lot of it in cloud, and much, much else crammed on hard drives, defunct servers, failed business models (remember MySpace or GigaOm) — is forever lost in digital space, exiled from human history.

Granted, there’s a lot of flotsam there — cat videos, sexts, grandkid pics, Bieber Tweets. As dark space occupies 95 percent of the universe, perhaps as much of the Web’s flotsam is dark to present understanding, maybe forever.

But the precipice we’re riding is real, and no one in human history has had to saddle it as we now must. In the midst of such speeding infinitude, we are flooded with data we have too little knowledge of. That won’t occur until we come up with the means to incorporate and digest and measure and sort it. The same way porn ran off with the early Web, Big Data is the biggest winner of Web 2.0 — interests too private to pay back to the whole.

Will we truly learn anything what happened to us on this climbing roller ride if the information becomes forever untracked?

* * *

When human history was more linear — say, back in the 20th century — archival of historical material was routine; part of the job. Universities and national institutions like the Smithsonian and Library of Congress maintained huge collections of American life and letters. Corporate entities like Coca-Cola and Motorola kept product archives. (King Features has an archive of comics and features materials going back to the early decades of the 20th century — most of the comics digitized, but tons more in fast-corrupting paper files). Government archives kept at local, state and national levels made their information available to all users. Non-profits like historical societies, hospitals and foundations were established typically with private funds and grants.

Every newspaper had its morgue of filed stories and photos. At the daily I worked at back in the ‘80s to mid-‘90s, its official name was the Library, though no one called it that. Stories with fat histories were wedged in overfilling folders, crammed with veloxes and teletypes and drafts of stories typed on erasable vellum. Some folders were stamped with skull and crossbones, alerting the next cub reporter to threats to sue for libel or slander. (Usually they signified a crosshairs for too good a story for a newspaper that was even more a business.)

Converting this all this musty, dusty paper and old-school microfilm into the far more flexible, indexible and sharable digital archives has been slow. The workflow is arduous and the returns so far academic. The process is costly.

Worse, the use and access of print-to-digital archives remains stubbornly opaque, first due to private interests and copyright law and second as programs and hardware that were developed to sort them out go out of use. (The Library of Congress must keep on hand all the hardware and software of the past 40 years in order to read collected manuscripts created in antiquated programs like WordStar.)

When digital files have been created from analog media (usually by scanning), there’s a chance of starting over when files get erased due to human failure, bit rot or natural disasters.

In the born-digital realm, however, existence is even much more precarious. While there is a project to salvage the old Web for future access through the Wayback Machine of the Internet Archive, its record of 430 billion Web pages is still partial (there are copyright issues, some website owners refuse to have their sites included, and due to the nature of how web crawlers work, large areas of the web are missed for other reasons). Many European archives have refused to participate due to suspicions of Silicon Valley absorbing their cultural inheritance. (Think of the European Union’s current anti-trust litigation against Google). Search is becoming increasingly difficult on it because tools remain so rudimentary for dealing with its exploding complexity.

This is the nasty challenge of preserving born-digital culture. Digital’s moment is always now. If Washington Post editor Phil Graham was correct when he said that journalism is “a first rough draft of history,” our sense of that encounter in the 21st century is suffering the worst bit rot. We haven’t even seen much proof that a greater narrative might become apparent were we able to assemble all of our rough drafts together.

It is as if the digital universe was exploding like grains of sand on the beach, its Etch-A-Sketch of history erasing into the next wave of Now.

* * *

Tools for digital retrieval are fast improving. Some of the technology is amazing. Damaged scrolls from Villa del Papiri that was destroyed when Mt. Vesuvius erputed in 79 AD are now being read due to an multi-spectral technology that allows them to read and digitize the parchments without needing to unscroll them. A recent UV light analysis of pages of the 750-yeaer-old Black Book of Carmarthen reveals text that had been erased and written over, revealing poetry never seen before.

Much can be done to save us from our dust. But once it is dust, history vanishes.

In 45 BC, Julius Caesar in his siege of Alexandria accidentally burnt its great Library the ground, destroying the largest collection of papyrus scrolls and stored knowledge of the ancient world. Such destructions — accidental and willed — blight our recent history:

  • In 1923, earthquake and resulting fires did catastrophic damage to the Imperial University Library in Tokyo. Some 700,00 volumes were lost, including Records of the Counties and Villages of the 19th Century. In the Second World War, bombs destroyed the National University of Tsing Hua, Peking; University of Ta Hsia, Shang-hai.
  • In 1937, floods in Ohio, West Virginia, Indiana, Illinois and Mississippi destroyed hundreds of libraries.
  • In the Second World War, libraries in Europe and Russia and lost about 130 million books and manuscripts to bombs, burning and theft.
  • During the Cultural Revolution in China in the late ‘60s, libraries were purged of “reactionary, obscene and absurd” publications.
  • This past February, Isis militants sacked Mosul’s central library, burning more than 100,000 books and manuscripts.

Worldwide, archives are threatened by fire, war, flooding, terrorism, bad storage, decay, theft — and bad archival. Microfilm quality of long-destroyed documents is marginal.

***

Digital erasures sweep by unnoticed, the way we can’t see the devastation of the seas from the still surface.

  • In 2002, a server crash at the Missourian newspaper wiped out an archive of fifteen years of text and seven years of photos. An obsolete software package prevented any real chance for retrieval. With that crash, fifteen years of a community were permanently lost.
  • Know the average lifespan of a website? (Answer: 100 days.) Remember the iPad app The Daily? It no longer exists online. Nor does the Department of Defense’s Marshall Islands Document Collection, or the Hanford Declassified Document Retrieval System. Gone to are all the contents of Megaupload, Pets.com, Yahoo! Geocities, Prodigy and Encarta Online. If you posted your blogs at MySpace or Spinner or New York Press, they were sold off. Eventually someone pulled a switch, and then all there was left was an indeterminate void until the cache cleared.
  • In 2009, the social networking site Ma.gnolia lost all of its data in a power failure. There was no backup nor effective recovery solution. In 2011, Health Net Insurance lost data on 1.9 million customers when nine server drives (all portable) went missing from a data center. They were never recovered.
  • With cloud computing, entire dominions of photos, videos, documents and music are privately held (read your Apple license) and could easily be an Apple or Amazon crashes hard (or someone simply pulls the plug).
  • Vastly networked corporations are especially vulnerable to cyberattack. Last November, the North Korean government hacked Sony Corp’s mainframes, stealing a vast amount of corporate data and then wiping out some 3,000 computers and 800 servers. The attack was similar to the “Dark Seoul” attack in 2013 against South Korean banks and broadcast networks that wiped out some 40,000 computers and wreaking $700 million in damages. (Last week 60 Minutes quoted a computer security expert who said that there were now “three to four thousand” hackers out there capable of such an attack and that their numbers were multiplying exponentially.)

Once the bytes are lost, the archive is toast.

Who then will have lost the most?

How would we ever know?

* * *

We do have the tools for taking a lasting picture of a culture at the moment of incredible transformation. Without comprehensively applying them, culture may itself vanish (at least as we have known it since the Gutenberg Press).

If we succeed in widely and wisely employing those tools, our remembered path will help us visualize the next straight-up assault. (Do you hear anyone saying that this ziggurat of digital expansion will ever wane?). We can also the record to learn about our failures.

Our time — strange, isn’t it, even the word seems retro — needs to be preserved, universally accessible and in a format permanently retrievable (using an open-source software)

Last week I spoke about the need for newspapers to network together to share traffic data and collectively raise the visibility of their community news on the web.

The same industry collaboration can be seen to contribute to successful outcomes in the growth toward global archives. The availability of a paper’s archived materials can in itself draw traffic and open new revenue streams.

Clifford Lynch is director of the Coalition for Networked Information and he had this to say about the archival effort of news retrieval:

Past experience teaches us that news is a central part of the cultural record; it is used by an enormous range of scholarly disciplines as well as being of lasting importance to the broad public. In the digital world the nature of what constitutes news has already changed in fundamental ways, and the traditional practices for preserving news for future generations will no longer work. This is a major crisis, both for the future of scholarly work and our ability, as a society, to maintain a rich and comprehensive view of our history.

The risk of not doing anything with that news is unacceptable. Given the real chance that a massive cyberattack will reap server and computer memories clean in one great sweep of the scythe, every day delayed adds to the danger. As with our rising seas, hesitation only increases he exponential reality of disaster.

If the financial incentives don’t seem clear — who’s to pay for all this archiving, who owns it, what’s in it for us? — perhaps we need to understand that our situation is not unlike that of akin to the cancer patient who submits to experimental therapy not so they can live, but so the next generation of cancer patients have a better chance to do so.

Our cultural survival depends on the archivist as a personality trait, a job description, part of a resume, a tenet of a 21st-century faith that what we do today as a big input on whether a house will still waiting for us at the end of the next long day’s commute.

* * *

All this may be increasingly moot. I mean, who’s to say in the end that all this data was worth saving in the first place? What nuggets of true gold are to be found at the bottom of so great a binary sea? How fraught are the politics of cultural records?

And what yet-developed technology will be able to sift through present Zettabytes and too-soon Yottabyes of data with the processing power and algorithms to cultivate sense to human comprehension? We’re just now discovering what magnitudes of tagging may be required to give sufficient future body to a piece. (Melody Kramer takes this on in this post.)

We still don’t know much about what happened before the Big Bang — all we know is that from a set of very dark, highly combustible data points, a multitude of possible universes came into existence, of which ours was and perhaps is only one possibility. (We just may not be seeing the others yet)

Archiving may never know itself as we’d like — it seems that the more we know about something, we sense how achingly much more we don’t. But it can help us with our future — to better identify which universe we come to live in.

Perhaps, even, to decide which next one is best for us.

--

--

King Features Weekly
Local and Thriving

Entertaining extras for community newspapers — today, tomorrow.