Virginia Leith in “The Brain That Wouldn’t Die” (1962)

Born-Digital News: It’s Not Dead Yet

It has been falling fast into the memory hole, but there are signs that’s tomorrow’s news will be saved

King Features Weekly
Local and Thriving
Published in
16 min readJun 25, 2015

--

By David Cohea

As I wrote in “The Archivist’s Blues,” a massive amount of information being created by an exploding Web is in danger of disappearing before we remember that it was there. For news websites who create “born-digital news,” this is especially perilous, as so much news crucial to a community’s health and identity is likely to vanish from the record.

That’s what participants were warned at “Down The Memory Hole II: An Action Assembly” (DMH-2), a conference last May to address the needs of born-digital news preservation. The two-day event involved a wide swath of memory stakeholders — press associations, journalists, technologists, publishers, archivists, librarians and vendors — and it ended with some good tentative first steps now being taken.

Tentative, because amid the wild explosion of born-digital news, nothing yet feels very substantial in reversing the trend down the memory hole.

And there are no experts, no proven solutions. Just a cross-disciplinary crew dedicating to approaching it from every conceivable angle . Out of that comes a tentative view of what’s next.

This post will look at some of the ideas — and solutions — that have percolated up in the time since.

Terri Gar, Peter Boyle, Gene Wilder and Marty Feldman in “Young Frankenstein” (1974)

Stillborn beginnings

One of the first studies to measure how much born-digital news is now being acquired, curated and preserved was conducted The Educopia Institute, in collaboration with a number of memory institutions in North Carolina. Two surveys were distributed — one for libraries and archives in the state,, the other for North Carolina news editors.

The findings weren’t good. While most libraries maintain archives of microfilm and digital print-pdf files, little more than 20% of the libraries collect digital news content in some format (website files, blog posts, or “digital-first” content). Only four respondents said their organizations had a digital preservation program.

Response from the news organizations was worse — only eight people responded to the news editor survey, and none reported having any written policies for born-digital materials.

The results, depressing as they are, serve as a good enough sample for the state of born-digital news preservation conditions abound the country.

If born-digital news is everywhere on the Web, isn’t it safe enough out there until someone starts preserving it? Hardly. Andy Jackson of the British Library investigated how much of the UK Web Archive collected between 2004 and 2014 was no longer live on the web. The results were pretty astonishing: after one year, half of the content was either gone or had been changed so much as to be unrecognizable. After ten years almost no content resides at its original URL.

Abbey Potter of the Library of Congress, who spoke to the DMH2 event, concluded from this research: “We have clear data that if content is not captured from the web soon after its creation, it is at risk.”

And it gets worse. Alan Mutter recently reported that in a survey of 141 news ventures started since 2010, one in four have failed. This year alone, the tech site GigaOm was shuttered for financial difficulties and ReCode was sold to Vox Media. The news app Circa shut down after failing to find financing, and Fusion, the collaboration between ABC News and Univision failed to find enough young Hispanics and became a destination for millennials.

Maybe such a failure rate is par for the course for Internet startups. If so, that means that born-digital news is more tenuous. And though not all these transformations represent an erasure, each twist in the road makes preservation a more difficult process.

With news organizations relying increasingly on social platforms to push their news, content is haphazardly archived at the platforms and have no guarantee of retrieval. In a piece at Poynter recently, Melody Kramer put it this way:

For libraries to acquire these new digital assets, news organizations must now collect and preserve news published on third-party platforms they do not themselves control. And that gets tricky, because it means working with third parties to preserve material that the third parties have the authority to delete at any given time, according to their Terms of Service — or may not want to preserve.”

Who knows what happens if those platforms disappear. And they do (think Geocities and Friendster). And with the volume of news content now dependent upon social platforms, it’s just another brick in the wall of the memory hole into which born-digital news is falling fast.

The problem is also in newsrooms, as preservation of born-digital news is off the radar of most news organizations. None apparently have much plan for any archival of it beyond internal access. I talked with an editor at the Orlando Sentinel who told me that the newsroom’s CMS has good archival access for reporters; and while you can Google search for stories in the Sentinel’s digital archive going back to 1993 or so, the Orlando Public Library currently has no plans to archive or preserve that born-digital news. Newspapers aren’t thinking out far enough.

Maybe newsroom downsizing is partially at fault. Smaller newsrooms mean tighter resource allocations for every task. There’s barely enough people on hand to knock out the day’s news.

Ernest Thesiger and Colin Clive in “Bride of Frankenstein (1935).

Strengthening public-private partnerships

Newsrooms also don’t have much incentive for actively pursuing born-digital news preservation. So far, no one has mined anything close to gold in them thar archives.

Edward McCain, curator of digital journalism and one of the co-sponsors of the DMH-2 event, says he is exploring that financial model more carefully. Maybe there are ways to better monetize archived content. “We’re hoping to collect data that shows the potential market, market demand, profit, and monetization of capturing content and re-selling it,” he says.

Crucial to this process is strengthening relations between private and public enterprises. “Building partnerships equals sustainability,” says McCain; “digital preservation means money. By working together to share costs, public and private institutions can preserve and provide access to the sprawling universe created by born-digital news.”

One example of an important public-private partnership can be found between newspapers (and other local news organizations) and libraries. Think about it: journalists create the first draft of the news; libraries preserve that news and make it available to the community, now and in perpetuity. For a deeper look into the possibilities of this partnership, see my post from last week.)

Other partners include press associations and other trade groups such as the Newspaper Association of America; memory institutions, from local historical societies to Library of Congress; technologists developing software; and vendors such as TownNews.com and NewzGroup who offer archival as part of their service (though not true preservation solutions).

“Preacher-scientist” George Speake.

Bringing journalists back in

Journalists are another great partner to bring into the preservation conversation. While their interest in the daily round may be slight, they become the loudest advocates for born-digital news preservation when it’s their news that is lost.

When the Boston Phoenix — an alt-weekly that used to be a mainstay in that city’s downtown culture — folded in 2013, Boston came close to losing a invaluable community resource. Stephen Mindich, the former publisher, promised to keep the paper’s website up (which had been online since 1994), and that all online and print archives would be preserved. That was two years ago, and sometime last year the site began going dark. In a CJR piece by Valerie Vande Panne — who once worked there — she wrote,

Obviously, this was cause for some self-interested concerns among the publication’s writers: Where is my work? When will it be back? More than one former Phoenix journalist I spoke to said something like, ‘My entire career is on that site. I sure hope it comes back.’

Ryan Thornburg, who teaches digital journalism now at the University of North Carolina, had a previous career as a journalist working at newspapers in the 1990s and 2000s. He told me of a similar experience::

The biggest losses I experienced were at The Washington Post, because almost everything we did there was the first time it had ever been done — first election night on the web, Clinton-Lewsinksy, 9/11…. Live chats, videos, quizzes, polling databases, email newsletters, mobile apps (before smart phones), etc. … Those are important for me because I need ‘clips’ to get jobs in professional journalism.

He also related how born-digital news is lost in the routine operation of the department:

As we moved from CMS-to-CMS, we intentionally killed a lot of pages because they received no traffic. I can tell you none of us even thought about the importance of archiving them for history. During breaking news, we overwrote files all the time without tracking the minute — but important — changes that would have showed future generations how stories unfolded in real time. Even when we tried to preserve pages, many of them would end up broken.

Born digital news on the Boston Phoenix website is taking a similar drubbing. The site was created in-house over the decades, and no record of the various overlapping generations of code. Mindich is said to be working with a local university for a handover of the archives, but the CJR article ends with this cautionary note to journalists: “archive your own work once it’s published — even if it seems like the publication will function in perpetuity.”

Good to do — but equally important (in this case, for the residents of Boston) is to see to it that born-digital news lives on in a memory institution. Perhaps when journalist jobs become tied more to foundation funding and civil service (such as in the Steven Waldman’s Report for America proposal) will the archival task become routine in the professional journalist’s day.

Boris Karloff in “The Man They Could Not Hang” (1939)

Vendors are vital

More often than not, vendors rather than libraries are where born-digital news ends up today. TownNews.com offers website solutions to 1,600 media websites (including newspapers, radio and TV stations) with a combined traffic of some 750 million page views. From the start, there was a question of what to do with aged website content. Chairman and CEO Marcus Wilson told me:

We started hosting newspaper web sites in 1995. Our technical staff asked me early on what should we do with the content that had been posted — ‘how often should we purge it?’ I decided that we should not purge it, at least until we understood the implications better. So, we’ve kept pretty much everything. (We do purge AP content automatically every 14 days).

Known by some industry as “The Accidental Archivist,” Wilson says that TownNews keeps the digital content for as long as the customer has been with them and even imports archived material that was there before they signed up.

The web publishing software TownNews uses is called BLOX and archival is written into the CMS — not simply there in the backup . Wilson tells me that editors and reporters find it eminently searchable.

Still, there are limitations. TownNews does not promote itself as an archival solution. They don’t promise long-term storage. When TownNews began offering e-editions (publishing print PDFs) and found that storage costs were high, they tried to pass along the cost to customers; most chose not to have TownNews store print-PDFs. Also, they have found it challenging to import archives from other media such as tape drives, floppy disks and hard drives.

“Not enough thought or resources have gone into thinking about saving digital-age content,” Wilson adds.

“Before I Hang,” 1940

Older preservation formats are catching up

“Born-digital news” may be a misnomer, since just about all content is created digitally nowadays.

Digital preservation of older generations of newspaper content, usually on microfilm, is evolving. Microfilm is not a bad preservation medium — it’s quite durable — but the images are not searchable. Digitization techniques to render microfilm up to industry-standard preservation standards are expensive — as much as $1 a page. Up until now the cheaper solution most have used is to scan and OCR the microfilm to create a somewhat-searchable PDF.

Fortunately, a new platform has been developed by DL Consulting that automates the process (scanning is done through the cloud), formats the digitized image into METS and ALTO standards maintained by the Library of Congress, and does so at a cost of just pennies per page. (Details here.)

Digital output files of print newspapers (also called “print-PDFs”) represent the generation of newspapers since the early 1990s not captured on microfilm. NewzGroup archives print-PDFs for its customers (they have a clipping service that pays participating newspapers back a license fee every time one of their articles is used). In the past, the amount of metadata that could be added to these files was more limited, so the preservation capabilities were not as robust.

But that’s changing. NewzGroup’s production system has evolved its import capabilities, making it possible for far greater metadata to be added to files. A file name program changes unconventional filenames into those that make them more searchable. Their system cam strip legal notices (or any other content identifiable by Boolean search strings) off the print-pdf’s and then paste them up on designated websites. Brad Buchanan, CEO of NewzGroup, tells me that with these enhancements he’s confident they will be soon be capable of storing born-digital news as well.

On the library side, resources are developing for managing print-pdf archives. The Digital Public Library is a good source for libraries seeking to better preserve these print-PDFs. The University of Kentucky Libraries has developed Paper Vault, a workflow, strategies and open source tool used to manage and provide access to digital newspaper content. It can add a little or a lot of metadata (depending on the organization’s resources) and can be accessed through the Internet Archive.

“Frankenstein Meets the Wolf Man,” 1943

Building a better donation agreement

One of the most important actions to come out of the Dodging the Memory Hole II event was a commitment to improve the transmission of born-digital content from the creating news organization to the receiving memory institution. Central to this is the donation agreement.

Under the current status of U.S. copyright law, donation isn’t easy. Written in the days of print, the law is confounded by the complexities of digital. There are provisions for legal deposit to the Library of Congress, but as of yet little born-digital news has been requested by the LOC. (The current librarian of Congress will be leaving his post at the end of this year, so maybe with a change of leadership the LOC’s role in born-digital news preservation will advance.)

Several recent donations show promise. KXAS-TV, NBC-Universal donated news footage from 1950 to 1979 plus broadcast scripts to the University of North Texas. And when Rocky Mountain News folded, all of its archives were donated to the Denver Public library. The donation agreements for both of those transactions took a while to hammer out and show the legal difficulties that need to be resolved to make this an easier process for others.

As it stands now, according to U.S. copyright law and laws for legal deposit, there must be cooperation between news publishers and those who preserve news. This greatly slows the process down. (Part of the problem may be that news organizations, seeking to mine ongoing value from their archives, aren’t willing to let them go to another party.)

In Europe, however, a number of countries (Denmark, Sweden, Norway and the United Kingdom) have changed their copyright laws so that national legal deposit institution(s) are mandated to preserve born digital content. And news publishers are required to furnish the content.

Because the thickets of copyright law are dense, guidance clearly needs to be provided to craft donation agreements that can navigate through the known difficulties. Language can be developed for donors, dealers and archival repositories. A media insurance policy should be included that indemnifies the new owner against using something that was “given” that wasn’t theirs to give. Cumulative experience will help each new agreement.

“Metropolis,” 1927

Just scrape it

With U.S. copyright law still trying to adjust to the digital age, time is wasting. And so, for many, “just scrape it!” is the rallying cry. As one DMH-II participant put it, “if we don’t collect and preserve it first, all the other rights don’t come into play.”

The San Francisco-based Internet Archive runs on this principle. While the public is invited to upload digital material to its cluster, its web crawlers collect as much of the public web as possible. Its web archive, the Wayback Machine, contains more than 150 billion web captures.

Then there is the UCLA Library Broadcast NewsScape, an archive of Los Angeles-area digitized TV news programs going back to 2005 — some 170,000 hours of programming in all. They’re indexed and time-referenced to enable full-text searching and interactive playback.

Access to these archives are somewhat limited; they’re called “dark” because they can be used research purposes only.

To get to the real value of born-digital news archival, memory institutions will need to do more than scrape sites. Frederick Zarndt, Secretary of the IFLA News Media section and one of the presenters at DMH-2, says that article archival is the real gold mine since web scraping is never more than partial:

Archiving or scraping websites is relatively simple. The Internet Archive does this with no explicit permission from the website owner. However websites, news websites especially, can be updated many, many times in a day, probably more times than one would want to scrape the site even if only the changes since the last scrape are captured.

Article archival may require more up-front work to establish, but operationally it’s a simpler task.

Collecting stories instead of websites requires some cooperation between the website owner and the collecting organization or between the story’s publisher / author and the collecting organization. Collecting and preserving RSS feeds is dead simple. I wonder why no one but the National Library of Sweden is doing this …

Mad scientist, source unknown

Advances in software

Another web archival operation is Ben Welsh’s Past Pages project, which saves the sifting homepages of media sites. News sites update rapidly, and one archival interest is to observe the transformation of websites over time. Currently there’s no way to visualize an entire website, but at least by storing past front pages, central changes can be observed.

A progression on that concept, created by Welsh in partnership with Reynolds Journalism Institute, is StoryTracker, an open-source tool for creating an orderly archive of HTML snapshots, extracting hyperlinks and additional metadata that shows the prominence of the link on the page and an animation showing the how the page changed over time.

Welsh, who works in data journalism at the Los Angeles Times, also has been working on back-end archival solutions — make it a function of software. One idea Welsh and RJI are working on is to create a transactional archiving plugin for WordPress, commonly-used in news platforms.

Another idea is to make archival a standard functionality of the CMS, with structured fields interoperability, semantic metadata, an industry-wide text format such as .xml, and stack functionality stitched into either the front or back end or both. Very technical stuff, but if CMS vendors can be convinced to make archival functionality universal, then it’s possible that a large part of the born-digital news preservation problem can be resolved out of sight and mind of most people’s daily work.

Montage from “Metropolis,” 1927

Widening the appeal of preservation guidelines

The preservation picture for born-digital news is cloudy but improving. There’s so much more content and it’s evolving so much faster — its plight is news to many. Standards of preservation for newspapers do exist, but they aren’t widely known, much less understood or applied.

Katherine Skinner is director of Educopia Institute, one of the co-sponsors of the Dodging the Memory Hole events. She says that moving forward, her goal is to improve that communication. “I want to create a set of preservation guidelines that are aimed at different stakeholder audiences,” she says. “One for libraries/archives, one for news producers/editors, one for CMS vendors like TownNews, and possibly one for press association directors. They would be lightweight, less formal, maybe more interactive.”

But the message has to move out quickly. Someone put it this way at the conference: “The longer we wait, the harder it gets to reverse-engineer the past.”

* * *

Maybe it’s no coincidence that the Sixth Extinction — the result of Anthropocene-era climate-buggering by fossil fuel interests — is eating up animal species at a rate that hasn’t been seen since the fast fade of dinosaurs some 66 million years ago.

That’s the conclusion of Paul Ehrlich and a group of scientists at Stanford University in a recent study published in the journal Science Advances. Some 41 percent of all amphibian species and 26 percent of mammals are now on the verge of extinction. “If it is allowed to continue, life would take many millions of years to recover,” says lead author Gerardo Ceballos of the Universidad Autonoma de Mexico. “And our species itself would likely disappear early on.”

Why is it that extinction’s Big Bang coincides exactly with born-digital’s Big Thang? Are they the secret twins of the age?

The Jacksonville Times-Union’s vanishing account of northeast Florida over the past ten years is weirdly similar to the disappearance of the Florida panther from the vast scrub acreage now turning into subdivisions.

Is the world and our news of it both disappearing from view because we have not stepped in to preserve either?

The study suggests that as human intervention is largely behind the Sixth Extinction, human participation could do much to reduce long-standing effects. “Avoiding a true sixth mass extinction will require rapid, greatly intensified efforts to conserve already threatened species, and to alleviate pressures on their populations — notably habitat loss, over-exploitation for economic gain and climate change,” write the study’s authors.

Perhaps, too, there’s hope that preservation of born-digital news will follow the lead of the conservation of global resources.

The necessity — even opportunity — for partnership is everywhere.

Interested in helping out? Sign up for the Down the Memory Hole Google group.

Madeline Kahn is the Monster’s smokin’ bride in “Young Frankenstein” (1974)

David Cohea is general manager of King Features Weekly Service, an editorial service for 700 weekly newspapers. Email David at dcohea@hearstsc.com.

--

--

King Features Weekly
Local and Thriving

Entertaining extras for community newspapers — today, tomorrow.