The following is a lightly edited and reference-enhanced version of the talk I gave at Web Archives 2015.
Good morning. One of the things that’s both exciting and challenging about presenting at this meeting is that we seem to have a really great mix of participants — archivists, technologists, librarians, researchers and users of web archives. I’m going to be speaking today about what we are actually doing when we archive the web, and how that aligns with or challenges the ever-evolving tenants of the archival profession. To do this, I’m going to have to talk about both the profession of archives and the technology of web archiving, and I need to apologize in advance because I’m going to do both in such a way that I will undoubtedly dumb down and oversimplify subjects that someone in the room is actually an expert in. I’m doing this not to insult anyone’s intelligence, but to create a common understanding of complex ideas that will hopefully spark some cross-disciplinary discussions.
When we speak of web archives, we often use the word “capture” to describe the process by which we, as archivists, obtain a copy of a web-based resource, whether it’s a web page, a PDF, or a series of tweets. It’s one of the steps in the process we call accessioning, whereby something is brought under the intellectual and physical control of archivists in an archive. In this context, the word capture might be easily understood to mean that there is this web-based thing out there, and like 19th-century hunters, we go out and we “capture” it, then bring it back to our zoo where we can put it on display. In common discussions among archivists, I suspect this is often something along the lines of what we think of when we discuss web “captures.” However, there’s another possible connotation to the word “capture,” which I think is more accurate — that of the photographer or other artist who captures an impression of a moment, and in so doing creates something altogether new, but which reflects the realities of both the subject of the capture and the capturer.
Much of the web that we are trying to capture is less like a tiger to be put in a zoo, and more like a performance to be documented. On a fundamental level, the web is a continuous series of calls and responses: one computer yells out “Marco!”, and a computer somewhere yells back “Polo!” That “Polo!” is our experience of the web, and depending on the technology behind it, it’s possible that “Polo!” was sitting on a computer somewhere, just waiting for someone to call out the correct “Marco!” so that it could be served up and presented in response, but it’s also possible — and more and more likely — that the “Polo!” we experience didn’t even exist until we called out “Marco!” In 2013, it was widely reported that 15 percent of all Google searches — at that time, that was about 500 million searches per day — had never been asked before, and so 500 million utterly unique “Polo!”s were being served back per day (and that’s not even counting Google’s personalization of search results, whereby different users get different results to the same query). Much to the consternation of anyone who has ever tried to capture by crawling a database-backed website, a significant portion of the web just doesn’t exist until we call out “Marco!”
Perhaps not surprisingly, it is a group of artists who are, in some ways, at the forefront of putting this understanding of the web as performance into practice. Rhizome, the New York-based organization that supports the creation and preservation of digital and online art, has developed a web archiving tool, WebRecorder.io, that embraces the idea of the web as experience or performance. WebRecorder is a “WYSISYA” application — what you see is what you archive — that records the experience that a person at a browser has of the web. Rather than claiming to archive “a website” or “a resource,” it instead archives one person’s experience of the web as they browse and search it, following links, calling out “Marco!” and getting back a sometimes unique “Polo!”
WebRecorder, like most web archiving tools, uses WARC files to manage and store its recordings of the web. We talk about WARC files a lot in web archiving, but I wonder how many practicing archivists — particularly those of us who rely on wonderful tools like Archive-It — actually know about WARC files. Much like the term “capture”, the idea that we have a file — a WARC file — that results from our web archiving activity can lend a sense of objectivity to the entire enterprise: the web resource exists, we captured it, and here it is wrapped in a WARC file. However, at its core, a WARC file is more like a photograph than a tiger — it’s a recording of a call that was issued over the Internet, including all the metadata about who issued it, using what software, at what second, from what IP address, and what was received in response. It even receives a globally-unique WARC identifier to differentiate it from similar calls and responses that may have been issued at different times or places.
The uniqueness of a particular call and response held in a WARC file comes into high relief when we start talking about the capture of something like Twitter not through the browsed experience, but through the API, as the tool, Social Feed Manager, that my colleagues from George Washington University will be speaking about tomorrow does. I’m not going to attempt to give a full explanation of the tool here — among other reasons, I don’t want to steal their thunder — but I do want to note how working with this tool has shaped my own thinking about web archiving.
SFM, as we call it, makes calls to Twitter’s API and records the responses it gets back in WARC files. However, as time passes, the information transmitted in those API exchanges and recorded in the WARC files begins to differ more and more from the current experience available through Twitter — Tweets are given more or fewer stars or hearts, retweets happen and are retracted, accounts are made private and then reopened, and Tweets and entire profiles can be deleted from existence. We’re currently working through the legal and ethical implications of how Twitter wants us to treat these changes when we hold an older version obtained through the API, and this exercise has brought the entire question of what is a capture and what does it mean to archive a web resource to the forefront of my thoughts.
My point, in going through all of this discussion of “captures,” calls and responses, the web as experience, and WARC files is to argue that when we as archivists engage in web archiving, it is not at all as neutral or passive collectors or stewards of received objects, but as active co-creators of the archive. The archivist and theorist Terry Cook — among others — argued persuasively that archivists are always co-creators of the archives we steward, through our choices about what to bring into the archive, what we do with it once it’s there, and what we remove from the archive. So acknowledging the fingerprint of the archivist on the archive is already a well-established principle of archival theory.
Yet for some reason, when we get to working with web archives, our language about the archiving process and the mechanisms by which we present the results often obscure the role of archivist as co-creator of the archives. As Ian Milligan, who will be speaking tomorrow, has memorably recounted, we do our users no favors by failing to both acknowledge and document the work that archivists do in creating the objects we present as web archives. In a particular case that I’ve heard Ian speak about, a corpus of web captures he worked with lacked fundamental information about selection and appraisal that rendered the resulting resource nearly unusable for peer-reviewed analyses.
I would propose that as we, as web archivists should adopt the stance of the late-20th century archivists, including Helen Samuels, who developed the idea of documentation strategy — that as archivists, we should identify areas, functions or groups we wish to document, then focus our energies on collecting archival materials related to these subject where they already exist, but also don’t shy away from creating archival materials where none currently exist.
As web archivists, we should not pretend that we are doing anything less than carrying out some of the best of documentation strategy’s theoretical underpinnings in a 21st century environment by creating archives where none existed, by calling out “Marco!” and recording what comes in response. However, we must also be radically transparent about that work, and do our best to educate all users about the context of creation both for the content that they may be seeing in a web archive, as well as the archive itself.
Information literacy, as we all know, is a real skill, and teaching people how to find and access information online is a role that librarians have successfully engaged with in many contexts. Archival literacy, the ability to find and make knowledge out of the information in archives, is similarly a skill and concept that archivists are engaging with. And web archival literacy layers these, requiring fluency both with the web and with the archival practices that created and continue to shape the web archives. Archivists must actively engage with defining what skills are necessary to successfully interrogate and develop knowledge based on the information in web archives, and teaching those skills to our users.
Furthermore, the ways that we describe and present web archives must support these activities. What this looks like I’m not entirely sure of, but I do know that they are challenges that we must actively engage, that we can’t engage them if we don’t acknowledge they exist, and therefore, that we must embrace our role as co-creators of web archives, something that our current practices have done only tentatively and awkwardly, if at all.