Don’t rely on websites to tell the truth about their past: why you need your own archive

Paul Bradshaw
Thoughts On Journalism
5 min readJun 28, 2016

--

This post was first published on the Online Journalism Blog

Archives image by DRs Kulturarvsprojekt

Previously I wrote about the problem with trusting Twitter to keep a public record of all tweets. But it’s not just social networks; we can’t trust any website to keep information on our behalf.

This week for example the group that campaigned for the UK to leave the EU deleted various claims from its website in the wake of their victory. If you were hoping to test politicians’ claims now against what they said then, you might struggle — unless you saved a copy.

Even with the best of intentions sometimes webpages just disappear: 3 articles highlight the problem particularly well.

Google loses interest and links rot

First up, Andy Baio, who wrote of Google’s abandonment of its archiving ambitions in ‘Never trust a corporation to do a library’s job

“After a series of redesigns, Google Groups is effectively dead for research purposes. The archives, while still online, have no means of searching by date.

Google News Archives are dead, killed off in 2011, now directing searchers to just use Google.

Google Books is still online, but curtailed their scanning efforts in recent years, likely discouraged by a decade of legal wrangling still in appeal. The official blog stopped updating in 2012 and the Twitter account’s been dormant since February 2013.”

Towards the end of last year Mario Tedeschini-Lalli wrote about how his work on the CNNitalia website was now completely gone from the internet:

“Its full four years of coverage — including 9/11 — are nowhere to be found, just like almost all of the journalism produced by myself or by the digital newsrooms I managed in the last 18 years.”

Then Alexios Mantzarlis wrote about a plugin which aimed to help prevent ‘linkrot‘ on WordPress websites.

“Factcheck.org, which launched in 2004 now has almost 6,000 dead links. Roughly one third of all the links on Pagella Politica, the Italian fact-checking website I edited before joining Poynter, are currently broken. At the same time, trying to manually keep tabs on the state of a site’s links is too time-consuming to be feasible.”

How to create your own archive

Installing a plugin is just one approach — but you need to be able to install a plugin on your site, and it only applies to pages that you link to — not, for example, the pages you are writing.

Here, then, are some other techniques for reducing the chances that your work and research disappears from the web.

Automated archiving

In my post about Twitter I mentioned the tool IFTTT for automatically storing a copy of someone’s tweets. That same tool can be used to automatically archive documents and webpages.

For example, IFTTT has a number of recipes which will automatically save email attachments to your Dropbox, Google Drive, or other cloud storage services.

But it also has recipes which will save files at a particular URL to the same services. Here’s one which backs up Reddit posts.

You can use these recipes to backup documents or webpages at any link that you share on social media. The example below uses Google Drive but you could also use OneDrive, Evernote, etc.

Unfortunately the results are only given the name of the user and the timestamp, so you’ll need to use other techniques if you need to search the webpages’ contents (here’s one).

Social bookmarking services with caching features

Another option is to use a social bookmarking service like Delicious or Pinboard. I am a big fan of these services because they make a significant difference in the time needed to re-find documents and reports that you have previously seen (and indeed in some cases forgotten about).

Some of these services offer ‘archiving’ or ‘caching’ functionality, where they will store a copy of any bookmarked webpages, typically for an annual fee.

I use Pinboard, which works out at $25 per year. Historious says that caching is included in all its plans, including the free one, while Diigo includes ‘unlimited caches’ in its $5 per month ‘Standard’ package.

Pinboard claims in a now five-year-old post that those plans are not directly comparable, however, with some not working with PDFs or embedded content, so check what you need and what’s provided. They still claim that “Pinboard is the only site that stores and indexes full page content, not just HTML.”

Diigo’s plan, meanwhile, offers the facility to upload a page “even if it is dynamic or hidden behind the password protection” while “you can also capture multiple versions of the same URL at different times.”

Combine the two

Of course Pinboard, Delicious and Diigo are also channels on IFTTT, and you can use that to back up your bookmarks too. Just follow the same instructions as above with uploading URLs to Google Drive or other services.

Indeed, it may be possible to use IFTTT to create a free alternative to the paid-for caching services — but remember this will be harder to search, and will not include embedded content such as images.

Still third party

Of course, any of these recipes still rely on a third party: wherever you are storing these files, whether that’s Google Drive, Dropbox, Pinboard or Diigo.

But it does bring all your material into one place (or more than one place for extra redundancy!), and make it much more likely that you will have an opportunity to move that material if you need to: if a service closes down it’s likely they will notify users so that they can export material.

Oh, and if IFTTT closes down check out the alternative Zapier.

UPDATE: How to save backup copies of webpages and social media

Paul Myers has written two posts on saving online webpage evidence and archiving evidence from social media platforms. Both are well worth a read for a more basic introduction to this area.

--

--

Paul Bradshaw
Thoughts On Journalism

Write the @ojblog. I run the MA in Data Journalism and the MA in Multiplatform and Mobile Journalism @bcujournalism and wrote @ojhandbook #scrapingforjournos