What will yesterday’s news look like tomorrow?

Journalists need to look outside of journalism to reinvent the future of news archives

11 min readApr 4, 2014

To understand how the news industry has transformed in the past decade and a half, it helps to consider our evolving perceptions of the relationship between news and time.

Think of it this way: When’s the last time you saw an A1 above-the-fold headline in print that actually revealed something you hadn’t already heard or seen online? Contemplate the eternity that passes in the 20 minutes between email news alerts from competing media outlets. Even the term “24-hour news cycle” feels obsolete.

What do our changing ideas about the relationship between news and time mean for yesterday’s news—including the really old pre-internet stuff, more recent stories published two site-redesigns ago, and the articles that went live this morning?

And how should newsrooms think about archived material given our expectations about access to information in a networked world?

It all begins with categorization.

Context is everything

At the Library of Congress, a small team has been working since 2011 to reinvent the way libraries around the globe catalogue their resources. Journalists ought to pay attention because newsrooms and libraries are facing many of the same core challenges.

“With the advent of the internet and digitization, the format and the structure of what libraries collect—and therefore what libraries need to describe—has changed dramatically,” said Beacher Wiggins, director of cataloguing at the Library of Congress. “So whatever we create now for this digital world we’re in, it has to be something we can map to the existing records. It is a massive undertaking, and further, we want to then map it to what those in the semantic web world are creating, a standardization or best practices that the web world is using.”

Wiggins joined the Library in 1972, at a time when it was just replacing its previous cataloguing system — the use of descriptive three-by-five-inch notecards filed in physical drawers — with the electronic system it’s now about to replace. “Our dilemma has always been, how do you corral content? And management of content really boils down to organizational management. How do you sustain that? Libraries don’t want to become museums but living institutions,” Wiggins told me.

The current system is built around machine-readable cataloguing — or MARC — records. Library goers can use simple computer terminals linked to a main in-house database of resources to search by title, author, publication date, or other keywords. (Here’s an example of what the MARC record for the 1908 photograph at the top of this article looks like.)

Now the Library of Congress is developing what it’s calling the Bibliographic Framework Initiative — or BIBFRAME — to link library resources to the much larger web of data online. The key difference between MARC and BIBFRAME means the library is shifting away from a process built on cataloguing descriptive details like “author” or “title,” and instead focusing on identifying and establishing all kinds of links between different resources.

It’s a system that reflects existing expectations about seeking information in a Google-able world — and it might help a bit to think of BIBFRAME as a system that operates more like a human brain than a physical card catalogue with unlinked resources in separate drawers.

In journalism terms, BIBFRAME represents the difference between the information you might get from a single printed article and the experience you have reading a heavily-linked piece of journalism online.

“It has to do with relationships, interrelationships and interoperability, and you have to start with the data,” Wiggins told me.

His team is working with several libraries and universities across the world to develop BIBFRAME standards, and he expects to have enough evidence to assess the viability and reliability of the new system by 2015. The project represents a step closer to making sense of the mountains of data that are piling up around us by building bridges between datasets to better contextualize them.

For journalists whose lives revolve around finding and contextualizing information, BIBFRAME could unlock all kinds of data troves and patterns that you would have to specifically seek in order to find today.

Whereas a reporter or researcher once had to hunker down in a basement full of microfiche for a glimpse into the carefully indexed past, networked resources can mean an enhanced research and reporting experience.

One of the Library of Congress’ main principles, according to 2012 paper about BIBFRAME is to “provide links as broadly as you can” because “you never know how someone or some machine is going to choose to navigate your Web of data… in ways that were not originally conceived.” This is a critical point because it speaks to enhancing a network of linked data as a way to anticipate future technological advances — an idea that’s missed by simple search-by-keyword databases.

“One thing that libraries have learned is you don’t try to set up a system or a methodology or a process that you then want users to adhere to,” Wiggins told me. “You need to be responsive to how users are evolving and what they come to expect.” This attitude and much of Wiggins’ thinking about the future of cataloguing mirrors how New York Times staffers are approaching the future of their newspaper’s archives.

Comparing Apple and apples

The New York Times is borderline obsessive when it comes to indexing its work. For example, anything about Hillary Clinton is described as, “Clinton, Hillary R.” If an article is about Apple, the company, it’s tagged in an organizational category; if it’s about apple, the fruit, it’s tagged as a description.

And today, the newspaper is focused on taking its existing indexing vocabulary and mapping that to other vocabularies used online, or “knowledge organization systems,” like the indexed terms used on Wikipedia. “It’s fair to say that the Times is in an enviable position when it comes to metadata around our archives — we’re really vigorous,” said Evan Sandhaus, lead architect of semantic platforms at the Times.

Yet the system isn’t perfect, and that’s partly because of how stories evolve — and also the arbitrary nature of coming up with the right name. “You’ll see a knot of tags around an event that are emblematic of the event, but I think a challenge that all news organizations face is how we can translate this knowledge of the entities involved in the event into a good name that we can aggregate around. It’s an active area of inquiry.”

Looking back often means cobbling together a series of events article-by-article based on searches by date and keyword. But imagine if there were better ways to group related stories as you filed them, so that future reporters could more dynamically navigate a news archive to understand the big picture of what happened.

To figure out how to tag today’s news before it turns into tomorrow’s archives, news organizations have to think about the essential functions of their archives in the mobile internet age.

The New York Times has fully digitized articles — but not complete replicas — going back to its 1851 start. It just introduced the latest iteration of TimesMachine, a complete digitization of old editions as they appeared in print from the paper’s 1851 launch through 1980.

You really have to explore TimesMachine for yourself. It’s stunning.

One of the many impressive things about TimesMachine is that it repurposes the mechanics of online mapping for what staffers designed to be a more immersive archival experience than a standard print replica. “So you really experience quite a lot of data in a really rich way inside of your browser using an interface paradigm that has become really familiar to everyone using the web,” Sandhaus said.

This strategy also makes the TimesMachine archive less of a burden on bandwidth. A single back-issue of the Times can be a 13.2 gigapixel image that exceeds 200 megabytes. But there’s no reason for someone to zoom in on the entire 13.2-gigapixel image at once. Instead, Sandhaus and his team split up these superhuge images the way an online map is broken down into tiles — since you really only need to zoom in on a small portion of the larger image at a time.

Unlike the Times’ single-serve-article archives, TimesMachine enables readers to experience the paper as it appeared when it was printed — so you get the editorial context of story placement and headline size, along with advertisements that original readers would have seen. TimesMachine is also a mind-boggling reminder of just how thick and text-heavy the newspaper used to be. You have to get to page 15 of the July 1, 1971, edition before you see a single photograph.

Sandhaus says the internet has forced the Times to confront “challenges that are more often encountered in the library space than they are in the online publishing space.”

But that’s true for any organization that creates content. Most of them just haven’t done anything about it yet.

The newsonomics of nostalgia

One of the key hurdles in building better news archives is to overcome conventional expectations about what an archive or catalogue ought to look like or how it should function.

This is the lesson the journalism industry has had to relearn repeatedly as habits built around old technologies are carried over to new ones. It’s why early radio broadcasters began reading the print newspaper verbatim on air, and helps explain the lingering tendency for newspapers to publish static PDF print replicas online. For some print shops, there’s more of a desire to preserve than there is to provide access.

The New Yorker and Harper’s have complete print-replica archives online dating back to their respective beginnings in 1925 and 1850 — in both cases the full archives are available only to subscribers.

At The Atlantic, editors resurface classic pieces for anniversaries — S.L.A. Marshall’s “First Wave at Omaha Beach,” (1960) around the anniversary of D-Day, and Jay Epstein’s “Have you Ever Tried to Sell a Diamond?” (1982) before Valentine’s Day, for example. But the magazine still has “a weird patchwork” of digital availability, The Atlantic’s Bob Cohn told me. (The magazine was founded in 1857 but its complete online archives only go back to September 1995.)

“Here are two reasons a story wouldn’t be digitized,” he said. “One is that we never got around to it, so we’ve done so opportunistically. The second is that we don’t have electronic rights to everything we published [before the internet].” One hint that the magazine is thinking more about how to use its archives? The “Atlantic Archive Party” it apparently hosted for staff members.

Other high-profile news organizations have practically non-existent archives from a usability standpoint. The Washington Post’s archives still run on ProQuest, a clunky third-party microfilm platform that charges for bundled access to old stories. A spokeswoman for the Post told me no one at the paper could speak to its archives or even characterize the extent to which the paper is thinking about news archives in the digital age. Archives for The Wall Street Journal and the Los Angeles Times are also available for purchase on ProQuest, and spokespeople for both papers declined interview requests.

Figuring out how to manage decades of dated content is costly and time-consuming. And there isn’t necessarily a clear monetary payoff for the effort. Archival access is often a perk that comes with subscription, though it’s not clear how many subscribers choose to pay in order to get access.

At the same time, younger news organizations are building entire strategies around the cultural obsession with looking back. There’s BuzzFeed Rewind, a vertical devoted to nostalgia. (This isn’t an entirely new concept: Magazines have devoted special issues to nostalgia for decades. Check out the cover of LIFE magazine’s Feburary 1971 issue or any number of Vanity Fair issues with a Kennedy on the cover.)

The nonprofit Retro Report launched last year as a documentary news organization devoted to following up on big stories from the past — Dolly the cloned sheep and the scalding McDonald’s coffee lawsuit, for example — and has a partnership with The New York Times.

By last fall, Retro Report videos were racking up tons of views—in some cases millions of views in the first few days after publication. The site’s longterm goal is to build an expansive “library of old news stories,” a plan Retro Report publisher Taegan Goddard calls “extraordinary” because it will require his team to revisit and update existing videos — retro Retro Reports, if you will.

He wants the site to be as much of a go-to as Wikipedia is, only for questions about the outcome of big stories that are no longer making headlines. But Retro Report thinks of itself as an educational service as much as it is a journalistic one, and Goddard says he understands why for-profit news organizations with “more immediate needs in terms of revenue and profit… haven’t figured out exactly what to do” about their archives.

Mapping a networked future for the past

One of the internet’s best tricks is how it can create experiences that can feel simultaneously ephemeral and permanent when the reality is actually somewhere in-between.

We’re at a point where we expect to find everything we’re looking for online — and very often we do. But what about the things we don’t know to search for? And what about when days, months, and years have passed?

This paradox — let’s call it the ephemeral permanence of networks — is one the news industry ought to relate to. After all, so much of journalism is about getting information on the record for posterity while the simultaneous obsession with newness—you know, the news—almost immediately turns even the boldest-type headlines into fishwrap.

But the power of interconnected networks saps influence from traditional publishers as gatekeepers, a role that news organizations have been reluctant to relinquish. Perhaps that’s why an industry that is fundamentally about change — noticing it, analyzing it, reporting it! — has been so godawful at actually adapting to it.

On some level, staying relevant in an era in which readers have more choices than ever means staking a place in history. The power of a big brand is the credibility it has established over time. But how can news organizations expect anyone to find their stories valuable today if those same organizations are sending the message that their archives aren’t worth showcasing tomorrow?

Meeting readers where they are doesn’t just mean gliding by in their Twitter feeds or on alerting them to a big story on their iPhones; it means being the source they find when they’re searching for something else — answering the questions they didn’t know they had.

News organizations need to design archives that better mirror the experience of consuming news in real time, and reflect the idea that the fundamental nature of a story is ongoing. This philosophy helps explain why terms like “breaking news” feel so stilted now. It’s not just because they’re overused and misused—and they are—it’s also because we measure news-time differently now. And it’s time news organizations think more seriously about doing the same. Newsrooms must figure out how to weave past and present journalism into the experiences we as readers are already expecting.

This is no small task.

The New York Times’ Sandhaus says he counted something like 3,620,294 pages in the paper’s archives. And the Times is publishing somewhere in the neighborhood of 381,052 additional words each day, according to its corporate site.

In other words, there is plenty to keep the archive-obsessed journalist awake at night.

“I’m thinking about building systems that make it possible to include our archive on our site in a way that feels as natural as embedding a video player or a sound player,” Sandhaus said. “What does that technology look like? What are the other things we can do to make the archives even more a part of the site? What are the engineering and user-experience challenges? What can we do to delight our audience with an enhanced archive user-experience? I think that right now, the answer is, we’re not quite yet sure.”