The Internet gets processed here: Archiving the web*

“In order to be a developer — and especially a web developer — it’s incredibly important to understand how the web works. From here on out, you are no longer just a user of the internet. You are a creator of the web…Until today you have always been a client. Moving forward you will be building the server. This means processing requests, creating responses, and sending them back to the client.”

Personally, I feel like everyone’s a creator of the web. Maybe not everyone’s a creator OF the web by creating servers and whatnot, but everyone’s a creator of what makes up the web today. The content and behavior of the web right now will determine today’s (and possibly only today’s) experience because it will inevitably change and be completely different and completely gone by tomorrow.

WHAT IS WEB ARCHIVING?

Well, according to the Internet, more specifically the International Internet Preservation Consortium (IIPC),

“web archiving is the process of collecting portions of the world wide web, preserving the collections in an archival format, and then serving the archives for access and use.”

There are conventions to the web that creators and developers should follow, which archivists try and use to their advantage in order to map out workflows and collections. There is a standardized preservation format called WARC (Web ARChiving file) for web archiving that the Internet agreed upon.

WHAT IS THE WEB ARCHIVING PROCESS?

Besides having those conventions and the WARC, the world wide web (www) should just be referred to as the wild wild west because there’s not really a standardized workflow or process in archiving the type of content (and quantity) on the web.

  1. Identifying
  2. Selection
  3. Harvest (acquisition)
  4. Preservation
  5. Access
A lot of the content is not actually all grouped together in one document.

“First, people archived their materials as part of a wider information management process, including the content on their social media sites, and their archiving was thus spread across a number of platforms. Second, the process of archiving was not an individual pursuit. Instead, people would, for example, rely on friends or family members to be able to keep a record of certain events. Third, much of the content is neither archived nor backed up since it is thought (often no doubt mistakenly) that it can be easily found again by searching through one’s file systems. Furthermore, much material, for example photos on a photo sharing site that is no longer used, are simply abandoned or discarded as not being worthwhile…Fourth, people regarded different sites or platforms as different facets of themselves, without any need for integration. Hence, while one might expect people to be worried about keeping their personal material in an online storage system or controlled by organizations, in fact, they used diverse methods, abandoning certain sites and maintaining their records in collaboration with others in their networks. This indicates that the practices of curating one’s personal life online as a means of keeping a record has not yet settled down into a consistent and well-organized practice, and perhaps it never will.”

So why web archive? Why even do DIY web archiving when there’s so much to do?! There are honestly multiple reasons.

WHY ARCHIVE THE WEB?

Web as History is important because the average lifespan of a website is between 44 to 100 days! The minimum is only a little over a month and then [poof] web pages can disappear! And they disappear for multiple reasons:

  • It could’ve been site maintenance
  • The organization could’ve not paid their bill
  • The website became private and went behind a paywall
The first webpage from 1991 (that’s almost 30 years ago!)
This was only 2016….(literally 3 years ago)
*Inspiration for my title.

HOW TO ARCHIVE THE WEB?

I mentioned earlier that there isn’t a formal, standardized workflow to archive the web. And there’s like even less documentation on how to use the web archiving tools available to users.

  • Heritrix is a web crawler we mentioned before but good luck trying to use it. Not easy to control or program and documentation sucks
  • HTTrack is basically the same thing
  • Command lines: curl and wget
  • FireFox plugin DownloadThemAll
  • TWARC: a command line tool for archiving Twitter JSON
  • webrecorder: it is one of my favorite tools and wish I could live demo on blog posts on how awesome it is
  • Chrome and FireFox WayBack Machine plugins — that automatically check for an archived copy in the Wayback Machine.
  • Save a page: you put a URL into the form, press the button, and Internet Archive instantly saves the page. Thanks Internet Archive.
  • Wikipedia JavaScript Bookmarklet: save a web page from any browser
  • Archive-It. Thanks Internet Archive.

In steps that you can totally Do-It-Yourself (including how NOT to make your website archivable):

  1. Have a site map listing the pages of your website
  2. Make sure all links work on your website, basically make sure the links remain stable — one way is not moving or changing the pathway and or names. Example of that….how many have changed their username on instagram? Yea every time you changed it, posts that had tagged your previous username will no longer link to you. If you do, redirect. If you don’t then these cause broken links.
  3. Use “durable”, archival formats like PDF
  4. Time stamp and/or cite those durable formats. One it’s for courtesy since you’re using someone else’s work, but also it might not work in the future so it’s a good reference and time range of when it did.
  5. Make sure content is explicitly referenced: aka your bit.ly links ruin my life. It’s OK if you’re on Twitter cause I know you have to make those 280 character limit, but if you have the space — type it out.
  6. Use semantically-meaning URLs — like don’t be adding illegal characters like $ to your tab can read “Ca$h”
  7. Identify your platforms — minimize reliance on external assets and see if you can have your photos and videos in one place.
  8. Make sure all links work on your website coincides with making stable links — developers can help redirect your links
  9. Conform to web standards: validate — http://validator.w3.org/
  10. If you don’t want crawlers archiving your page, using a third party web builder like wix, weebly, wordpress, etc. they often have robots.txt.
  11. And what is robots.txt? Robots.txt or robots​ exclusion standard is a standard used by websites to communicate with web crawlers and other web robots​. It specifies how to inform the web robot​ about which areas of the website should not be archived.

In “Consulting a web developer” for help, I had recommended the following back in 2016:

  1. Follow accessibility standards from the get go — developers know this — and if you want to look more into it yourself w3 has great resources
  2. If you don’t think you or your website builder template can’t validate than hire a web developer to conform to web standards for you
  3. Make sure content is explicitly referenced
  4. Allow browsing of collections, not just searching

REMEMBER: THERE ARE ETHICS TO WEB ARCHIVING!

Just because it’s on the web, doesn’t mean you should just take it. Don’t be that person! There IS WEB ETIQUETTE!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store