Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files

The following is the narrative of my presentation at the Preservation and Archiving Special Interest Group #pasignyc Fall 2016 meeting hosted by The Museum of Modern Art. The talk was part of the “Political and Social Responsibility, Impacts, Activism, Ethical, Anonymity, etc.” session that was moderated by Erin O’Meara and included presentations from Jasmine Jones, Micha Broadnax, and T-Kay Sangwand. Special thanks to Jarrett M. Drake for his guidance through developing this talk and to Chris Bourg for permission to use her image and to #citeherwork.


Introduction / Argument:

I want to begin my talk by framing it around a quote that has stuck with me ever since I read it. The quote comes from Chris Bourg’s talk titled “Never neutral: Libraries, technology, and inclusion” that she gave at the Ontario Library Association Conference in 2015. During her talk Bourg states that “despite the democratizing promise of technology… the digital tools we build and provide are likely to reflect and perpetuate stereotypes, biases, and inequalities.”

She specifically talks about search retrieval and catalog specific technologies, but the same can be said about the technologies archivists use to process born-digital materials. Further in her talk, Bourg makes the sobering point that “without active intervention we end up… classifying and arranging our content in ways that further marginalizes works by and about people of color, queer people, indigenous peoples, and others who don’t fit neatly into a classification system that sets the default as western, white, straight, and male.”

In the case study I am about to present, I argue that it takes critical awareness, consciousness, and ethical responsibility to uphold the cultural and political integrity of archival collections that are located outside of the (in)visible default of “western, white, straight, and male” — and, I add here, collections written in languages that are not English.

Without this critical awareness archivists run the risk of projecting the (in)visible default onto these collections, which, in turn, influence the outcomes of our processes, and the way we provide access to, and (mis)represent, information.

Background:

The papers of Argentine poet and human rights activist Juan Gelman arrived at Princeton University in July 2015 and were processed and open for research in June 2016. The link to the finding aid is here. His papers are typical a writer’s archive and contain handwritten, typewritten, and printouts of his writings, correspondence, notes, research files, and personal photographs. However, about half of Gelman’s papers contain analog and born-digital files relating to the human rights investigations he conducted on the forced kidnapping and death of his son and daughter-in-law.

María Claudia García Iruretagoyena and Marcelo Ariel Gelman; circa 1976; Juan Gelman Papers, Department of Rare Books and Special Collections, Princeton University Library.

In 1976, the far right government in Argentina kidnapped Gelman’s daughter, Nora, 19; his son, Marcelo, 20; and son’s partner, María Claudia, 19, who was 7 months pregnant at the time. Nora Gelman was released three days later, but Marcelo and María Claudia became two of the period’s estimated 30,000 kidnapped or disappeared, in a chilling period known as the Argentine Dirty War. The couple’s child was born in captivity, and became one of the approximately 500 children whose biological identities were kept secret, and who were adopted by couples sympathetic to the military dictatorship.

3.5" floppy disks from the Juan Gelman Papers, Department of Rare Books and Special Collections, Princeton University Library.

Gelman’s search for truth, and the pressure he put on the Argentine and Uruguayan governments to expose the fate of his deceased and missing family members, is evidenced in the contents of the 164 3.5” floppy disks that came with his papers. The floppy disks contain saved Word documents and email files dating from 1995–2004 of conversations Gelman had with other human rights activists, lawyers, and victims of torture. Among these files are court documents pertaining to the various charges Gelman filed against the murderers of his son and daughter-in-law. There are also files relating to Gelman’s search for his missing grandchild, whom he successfully found living in Uruguay in 2000. This is all to say that archival content of the 164 floppy disks provides a deeply painful but remarkably important insight into Argentina’s dark and not too distant past. This archive, in and out of all of its varying identities as a Latin American literature archive, is as Michelle Caswell defines, a human rights archive in that was created out of a lack of an official government record and used to bring truth and justice to the people that were most affected by crimes against humanity.

Processing the 164 3.5" floppy disks:

In June of this year, with the guidance of my colleague Jarrett M. Drake, who is our Digital Archivist in the University Archives, I began disk imaging each floppy disk in the BitCurator environment and hit a couple of snags along the way. The biggest challenge that we encountered was an “invalid encoding” issue due to the Spanish-language diacritics present in the file and folder names.

Screenshot of our BitCurator laptop at the Manuscripts Division in the Department of Rare Books and Special Collections, Princeton University

Because of this issue I was unable to open files that had diacritics. I was also unable to open any file, with or without diacritics in its name, that was nested in a folder that did. Initially I was concerned, but continued along the workflow because I noticed I could copy files in my Windows desktop and was able to view and access them there. I proceeded, with caution, knowing that a). the files were not corrupted and b). the file names actually displayed correctly, with their diacritics, using another OS.

It wasn’t until Jarrett and I were attempting to bag the disk images and their corresponding metadata files for deposit into our preservation storage environment that we were stopped in our tracks. Bagger was unable to create the bag due to the invalid encoding errors. We calculated that about 15–20% of the approximately 5,000 files contained diacritics in their filename and a much larger percentage of the files were nested in folders that did; so the issue was present in a large portion of the files and folders. Unable to bag, we were unable to continue to the last step of our processing workflow.

Community Advice:

We took to a couple of online discussion lists to ask other professionals in the field if they’ve ever encountered this issue and if they have, if they had any advice or resources to share. We received many responses with different approaches to how to resolve this issue. Some folks suggested not using Bagger but trying BagIt instead; some folks had more questions about the encoding set and file system the floppy disks were originally encoded in. However the bulk of responses pointed to replacing or removing the accented characters with the use of scripts or tools that “cleaned” or “scrubbed” the “illegal” characters by replacing them with underscores or their unaccented characters. Below is a compilation of some of the responses we received:

While completely appreciative of the wealth of responses we received on our inquiry, I was bothered by a). some of the language that was used to describe the issue and b). by some of the recommended approaches to rectifying the issue.

Furthermore, I was bothered by what these recommendations can tell us about where we are as nascent and not so nascent technologists in the field of born-digital processing. As I read and considered some of the responses, a couple of questions popped in my head:

1). Why did folks perceive that the diacritics of a file or folder name have to be “scrubbed” or “cleaned” in order to be “validated” by the tools and systems that we use to process born-digital archives?

2). Why do we have names like “detox” for tools to remove, essentially, markers of a language’s non-Englishness before we consider them “sanitized” enough for our preservation environments?

3). Why were the Spanish-language diacritic glyphs confounded with “illegal characters” in file and folder titles such as ? < > : * | “ ^?

At this point, I took a couple of steps back and concluded that the issue did not originate with Bagger or with the file and folder names themselves, but with the Debian based OS we were working on and its inability to recognize the way the file and folder names were originally encoded. It took us a bit of time, but we were able to find an online resource that explained that diamond shaped symbol with a question mark “�,” and the trailing “(invalid encoding)” in the filename is something that can happen when moving files from Windows to Ubuntu Linux. To fix this we had to utilize two Linux tools, “convmv” and “dos2unix,” to repair any operating system incompatibility. The following commands will install them on an Ubuntu Linux operating system:

sudo apt-get install convmv
sudo apt-get install dos2unix

These two commands took care of any future files and folders that our Linux laptops will read from then on, but the issue with the Gelman files persisted since they had already been read with the diamond shaped symbol with a question mark “�” and the trailing “(invalid encoding)” in the filename. To remove these and to be able to read the original diacritics we tried the following commands:

convmv -r -f windows-1252 -t UTF-8 .
convmv -r -f ISO-8859–1 -t UTF-8 .
convmv -r -f cp-850 -t UTF-8 .

The three separate commands took the three most popular character encodings in Latin based languages and made them compatible with UTF-8, which is what Debian based operating systems like Ubuntu are set to. The commands started a dry run that showed us that the following file path:

mv “./Desktop/cdo/disk062/Gonzalo Dossier/Datos sobre Ricardo Medina extra�dos de distintas fuentes.doc”

will be changed to

“./Desktop/cdo/disk062/Gonzalo Dossier/Datos sobre Ricardo Medina extraídos de distintas fuentes.doc”

These commands worked beautifully and we were able to read the original file and folder names with their diacritic marks, open the files, move on to the last steps of our processing workflow and complete the bagging process for deposit onto our server.

In the end the issue was as simple as incompatible operating systems. But both Jarrett and I recognized that the true lessons in this scenario are the following reflections:

What does removing or scrubbing the accent marks of a file or folder name do to the very identity of it? How can it potentially change the meaning behind the original intent of the creator? How can it potentially change the way researchers consume information?

Let’s take an example from the Gelman case. Had we decided to replace the accented letters with their non-accented characters or replaced the affected glyphs with underscores, we could have run the risk of completely changing the folder or file name and thus misrepresenting information. Take the word “campaña” for example — a word, in fact, that Gelman used to name many of his files and folders with because of the many human rights campaigns he conducted to generate interest in the case of his missing granddaughter. Had we run a script that would have changed the eñe with an n, it would have changed the word to campana, which changes the word to “bell.”

Campana (bell) does not equal campaña (campaign).

This is all to say that keeping the linguistic integrity of these file and folder names shouldn’t be a “compromise on an ideal” — it is the very practical responsibility of processing archivists and other digital curation laborers to preserve as much of the original content and context of the collections we’ve been entrusted to preserve and provide access to. Diacritic glyphs are an inherent part of the content that distinguishes certain words from others; ridding the file and folder names of them will misrepresent the collections they are meant to describe and misinform the users who are researching them.

Perceiving diacritics as a compromise on an ideal born-digital processing scenario, or thinking it is an acceptable practice to purge, sanitize, or cleanse file and folder names of them is representative of the amount of work the profession needs to do. We need to reflect, think critically and conscientiously of how our own perceptions of what is possible are so influenced by the invisible defaults we all operate on as citizens working in and alongside industries that are dominated by cisgendered, heterosexual, English-speaking, white, men.

And I understand that there will be times when there is no way we can work through certain issues and we must intervene to try to capture as much content as possible. To times when we must intervene and remove something that is inherent to the context or content of an archival collection, it is our ethical responsibility to document and articulate our need to act.

At my home institution, we have a small working group of processing archivists, like myself, whose processing responsibilities are expanding to include born-digital content. We meet every other week to workshop our current born-digital processing issues and questions. This working group has become an incredibly empowering and thought-provoking real-life discussion group as it has given us the space to slow down, think out loud and critically about some of the issues we are facing as nascent digital archivists. With more archivists processing a diverse array of born-digital material we recognize the need for a flexible workflow that is best suited as guidelines to consider rather than a hard-lined set of instructions. In fact, Jarrett just renamed our workflow documentation from “instructions” to “guidelines.” More and more we are seeing that each collection engenders a unique set of processing issues, in particular to manuscript and personal papers collections. To best facilitate our individual and collective workflows, members of the working group are beginning to write documentation reflections of unique issues we encountered while processing born-digital collections. These reflection documents serve as offshoots to the guidelines that we hope will help future troubleshooting when met with similar issues with other collections.

Conclusions:

As archivists we have the responsibility to “do no harm” (I am pretty sure I can indirectly quote Jarrett here) to the collections we are trusted to manage and keep safe. But often in our archival processes, especially in the newer frontier of born-digital processing, we must juggle many constraints and responsibilities (i.e. time, resources, knowledge, expertise) that force us to make exceptions and workarounds to this “no harm” principle. I am fortunate to be able to meet biweekly with a group of talented colleagues who are also managing these responsibilities and are learning the processes while asking questions on how to best proceed. This is very important; though I recognize the priviledge in this, and know that not all archivists and and other digital curation laborers may not be afforded this space or time at their home institutions.

The questions I leave you today are:

which exceptions are we willing to make or which processes are we willing to work through (as opposed to work around), based on the perceived value or implications of the issue at hand? Where do social and political implications fall on the spectrum of other pressures that archivists deal with in their everyday practices?

With this case study I did not wish to focus on the actual limitations of technology, because I believe technologies will always be limited. Nor was it my goal to be any way accusatory to the people who took the time to respond to our inquiry and give us their thoughts on the matter. My desired focus was on the perceived limitations of technologies and how these perceptions hold a mirror to us as a profession. As we adopt tools, collaborate with colleagues in other fields, and learn the skills to build the processes that will aid in keeping digital collections safe, we must be conscientious of the technological needs of collections that fall outside of the invisible default that, relating back to Chris Bourg’s point about the democratizing promise of technology, without active intervention, we will end up incongruously pushing the default onto.