A Guide to Archiving Web Pages

Unlike traditional print media, online data can easily be modified or deleted. 3rd party services cannot be trusted to keep relevant data available. Evidence should be archived by the researcher immediately after finding it, even if the researcher might not see the relevancy of the data at the time of reading it. The rule of thumb should be “Archive Everything” (even this guide) while researching the past actions and comments of a public figure or company.

Online Archive Services

There are many online sites that offer to archive web content for you. Some of these services (webcitation.org, Archive.org Wayback Machine) are well known for deleting archives on demand of individuals or companies without asking too many questions. Most common methods are baseless DMCA takedown requests or unsubstantiated claims of “harassment” or “online abuse”. Be aware of these tactics before publishing critical archives in an article or blog post.

Worth noting is that a DMCA takedown request implies ownership of content, ironically this can give additional legitimacy to archived evidence (assuming you made backup archives on other sites).

An archival site should optimally offer a timestamp of the time and date of archival and the original URL of the archived page. This information can often be relevant evidence in itself. This feature can also be used to document ongoing editing (e.g. censorship) of relevant web pages over time.

Archive.is

Archive.is

Archives static web content. Does not save Flash videos or PDF files. This archive site is currently blocked in Finland.

Example archive: https://archive.is/QIdpP#selection-105.0-105.5

As you can see in the example, selected text is automatically added to the URL of the archive (#selection-105.0–105.5). Your browser window will automatically scroll to the selected text area if that link is opened in a new tab. This feature is especially useful for citation references in articles.

Peep.us

Another archival site. Requires a Google account. Archives can be deleted by the person who requested the archival, so be wary of archives made by other people. An upside is that you can archive private pages (like forum threads with login barrier).

Tweetsave.com

Tweetsave

Useful for archiving tweets on twitter.com. A tweet saved with Tweetsave is automatically archived on Archive.org and Archive.is (currently not happening anymore, archive to Archive.is manually!). Make sure to archive all tweets in a relevant conversation. Tweetsave only archives single tweets, so it might be reasonable to archive long conversation tweet chains manually with a single Archive.is archive.

Automated Archives like Archive.org Wayback Machine and Google Cache

Google Cache

These services use web crawlers to save web pages automatically, that makes these services useful for finding deleted or temporarily unavailable web pages. Archives found that way can then be archived again with a more trustworthy manual archival service.

Example: archive of an Archive.org archive: https://archive.is/mZX8j

Example: archive of a Google Cache archive: https://archive.is/3xjnP

Archive of an archive (no lame Inception joke here)

Both of these services are well known for deleting archives, so they should not be used for long term archival of sensitive evidence, also Google Cache only saves pages temporarily and individuals can demand a deletion of Google search results related to their names according to EU law.

A useful search engine for finding cached versions of a web pages is: http://cachedview.com/

Download Archives

Ultimately every online archival service is corruptible or can be forced to delete archives given enough pressure. Also some pages cannot be archived by online archival services due to paywalls, dynamic content or restricted access rights. Important web pages should therefore be archived to your local file system.

The downside of this method is that the trustworthiness of the archive is entirely based on your own word since you could easily manipulate the downloaded archive. Having a questionably trustworthy archive is still better than having no archive at all though, so this method can be used as additional safety net besides your online archives.

You can also ask another trustworthy person (preferably not a relative or close friend of yours) to make a local archive of the relevant page too.

Archive.is offers a “Download as zip” feature for their online archives.

Archive.is download feature

There are also various browser plugins for Firefox and Chrome that offer this feature.

It might also be reasonable to make a full-page screenshot of a website, there are various browser plugins available for this too. You should easily find them in the respective app store of your browser.

One of the many browser plugins for full page screenshots

Archiving Videos and other Media Elements

You can use the freeware JDownloader2 to download videos from YouTube and other comparable sites (Vimeo, Dailymotion, VOD Twitch broadcasts, etc.). It can also download all sorts of embedded media from web pages, which might otherwise not be caught by normal archival procedures.

There are also sites that offer you downloads of online videos, if you do not want to download any tools just for that specific purpose:

Useful Browser Extensions

Chrome

Firefox


Sources

“Useful Resources and Tools” by @BoogiepopRobin http://pastebin.com/nztjK7Jd [A]

Show your support

Clapping shows how much you appreciated Hans Schmitt’s story.