Web-archives and big data: managing the messiness

Interview with Niels Brügger by Laura Skouvig, University of Copenhagen, Denmark.

Archives by Marino González is licensed under CC BY 2.0

How to record a Skype-conversation? Embarrassing as it is, I admit that I didn’t know when I needed to. Luckily my dialogue partner was Niels Brügger who offered all possible technical assistance. Despite this I never got hold of my digitally recorded interview. The challenges of preserving the digital world suddenly became quite present to me. In a luddite-like manner I had scribbled some notes on a paper and they make up the main constituent of this essay.

I had set up an interview with Niels Brügger, Professor in Internet Studies and Digital Humanities (MSO) at University of Aarhus, to elaborate on the points he introduced in a lecture on web-archives and big data in September 2015. Web-archives raise questions about how to preserve the digital present and are now and will be an important aspect of a digital (research) infrastructure. The lecture offered many questions in relation to what could be termed the nature of web-archives. In the interview I wanted Niels to go more into two aspects that I as a historian find really important:

1How to archive the internet and what kind of research infrastructure do web-archives offer? People perform all sorts of daily activities on the internet and whereas sociologists may study it right now the question is how coming generations of historians can use archived web for historical studies.

2The similarities between the current situation and the period around the invention of the printing press seem obvious for a historian to discuss. How is the present situation different?

My first question stems from a kind of “brave new world-thinking” where digitally preserved archives offer historians great possibilities of efficiency saving them for archive travels to odd places around the world.

What are web-archives? And how do they work? How much do they preserve?

- The Danish web-archive, Netarkivet consists of material from the Danish internet - the .dk-domain and Danish material on other web domains. The archive uses three different strategies: a broad harvest where everything is collected, selective strategies and finally event strategies e.g. in connection with sport events, political events etc.

He continues and says that Web-archives are however not uniform - and especially not when it comes to accessibility. In Denmark researchers can be granted access online when they have a research project but at other national web-archives you need to be on site.

In your lecture in 2015 you discussed the relation between web-archives and big data. When big data is defined, one of the distinctive features is the amount itself. Mayer-Schönberger and Cukier (2013) illustrate this by contrasting the gigantic volume of information at present with how much was generated before and after the invention of the printing press. But is it necessary to discuss web-archives in relation to big data. Isn’t it just a buzz-word?

- Netarkivet contains app. 600 TB at present and it grows with at least 30 TB for every broad harvest conducted 4 times a year. So that is extremely much data. Yet, humanity has faced big data at other times - or perhaps at all times. Size is not everything when it comes to the challenges of the web-archives.

Another problem is the messiness which is another characteristic of big data. Can’t we just sort it out - those parts that we do not want? Like discarding archival material and books in libraries? No - that is the short answer. The web does not work like that - and it would simply take too much work to sort the material. I realized that I just asked within a solid analogue, “printed-world-view”. So my next question is a bit more elaborated trying to link the past and present together.

Since at least Roman Antiquity scholars have strived for finding means of containment in the sea of information. The American book historian, Ann Blair, investigates how early modern scholars tried to cope with information in the periods surrounding the invention of the printing press. Behind the need for new tools, Blair identifies a fear of losing information. The attitude towards information was characterized by an “info-lust” (Blair, 21010, p. 6) that defined how the scholars approached finding, selecting and the storing of information. The scholars of the early modern Europe knew of the great loss of Classic texts and focused on protection and storing of information. For me one of the central questions relate to how to preserve the web for posterity and what kind of material historians meet in 100 years regarding the web.

- The form of the digital data in the web-archives is of a special kind. There are three kinds of digital material: the digitized material, the born digital material and the re-born digital material. The term re-born digital refers to the changes that arise through the actual archival process when storing e.g. homepages.

What you find in the web-archive is in a large number of cases not an exact copy of what was found originally at the homepage. It is only fragments. And this only deals with the collection of the material - what is shown by request in the web-archive is another story. The real difference is the digital form. Big data can be approached in two ways: either as small data or as big data. The former is how researchers mostly think - asking conventional questions. The latter means that researchers can ask new sorts of questions - question of which they have not yet conceived of.

I am interested in this distinction between the copy and the original when it comes to the archived homepages. Could you compare it to a medieval palimpsest?

A medieval manuscript is identical to itself over time. The manuscript ordered at the library desk is the exact same as was written 600 years ago. And asking for it at the library desk two persons would get the same manuscript: you get what you see. It is different with an archived web-page where you do have different layers and different points of time in the same (archived) web-page. This lack of simultaneity derives from the timespan of the archiving process itself. But if two persons ask for the same material, they would thus not get the same view because they ask different questions. Looking at a medieval manuscript is (unless you wish for a material analysis) seeing what you get.

Historians investigating archived web will have to include the concealed layer of the code. In general historians need to elaborate their well-known textual critical apparatus - a website-philology - to cover the different aspects of an analytical approach to using websites as historical material. One of the (many) frustrations of web-archiving seems to reflect the fear of the early modern scholars: the fear of losing information?

- Researchers of today need to propose tools and ways of building up a digital information infrastructure.

In a way we hope today to preserve the present better than the past has done. It could be hoped that digital material could be stored without blank holes. But the dream of preserving everything for posterity is however impossible and the odd sized digital material forces researchers and cultural heritage institutions to rethink the analog information infrastructure because it does not correspond to the form of the digital material.

Blair, A. (2010). Too much to know: managing scholarly communication before the modern age. Yale University Press

Mayer-Schönberger, V. og Cukier, K. (2013). Big Data. A revolution that will transform how we live, work and think. John Murray

About the author:
Laura Skouvig is associate professor at the University of Copenhagen. With a background in history she combines information studies and history in discussions of the current information society as a historical construction. Major research themes are besides this surveillance history and information network and cultures in 19th century Denmark with a focus on the absolutistic culture of information.