SAVING SCIENCE, ONE DATASET AT A TIME
“Save the data, save the planet.”
Scores of volunteers gathered in an MIT dining room on a sunny, unseasonably warm February Saturday with a singular mission. They were going to save scientific data they feared the Trump administration, through ideology or neglect, would remove from public access.
DataRescue Boston @ MIT is part of a rapidly growing movement of volunteers — scientists, computer engineers, librarians and archivists — determined to copy and archive public scientific data before it might disappear. The data rescue movement is shepherded by Environmental Data and Governance Initiative (EDGI) and DataRefuge, both of which are networks of scientists concerned about preserving scientific infrastructure. Work began at a “guerilla web archiving” event in Toronto on December 17th and has grown rapidly. On the same day as the MIT rescue event, groups in Washington DC, Boulder CO, and Haverford PA were also hard at work. While the original emergency focus was climate data, seen at special risk, the scope has grown to include larger sets of environmental data from the Environmental Protection Agency, the Department of the Interior, the Department of Energy and the National Oceanic and Atmospheric Administration.
Despite a history of less than two months, Data Rescues are highly organized events. There are “surveyors”, people tasked with reviewing target web sites, mapping out the organization, and developing primers analyzing what was found. “Seeders” take the surveyors’ primers and systematically go through each web site and, using a Chrome web browser extension, either nominate a dataset for archiving or mark it for special attention. “Harvesters,” in turn, review each dataset that requires attention and determine methods to capture them. Datasets that can be harvested by routine methods end up at the Internet Archive’s End of Term archive. Datasets that require special handling are destined for the DataRefuge archive. Lastly, there are “storytellers”, including this writer, assigned to document the event and its people, as well as develop stories around the data being saved.
Volunteers at the MIT Rescue ranged from first year students to chairs of academic departments. There were climate and health scientists, concerned about data in their fields, and computer scientists who were there to help build tools and harvesters. And there were the librarians and archivists, for whom preservation of information is a calling. While some attendees were self interested — their research careers depend upon the data they were rescuing — many came for a broader set of reasons. “I don’t want to see measles killing 1000 children a year like it used to” said one participant who was focusing on FDA data. Another called data deletion “the modern form of book burning.”
In the afternoon, a group of volunteers and leaders met to talk about long term sustainability. It’s one thing to identify data sets and get them safely archived. But data sets that can’t be found after being rescued are worthless. Along with the data, its metadata — the context and description of the data — is necessary. And lastly, the data’s provenance must be maintained. There has to be a means to prove that the archived data is the same as the data that was originally on a government web site. The librarians and archivists in the group were there to caution the technologists that this wasn’t a new problem and that the solutions would end up being more complicated than they might imagine. However daunting, this was a discussion that began a transformation of a rescue mission into a barn building, creating an infrastructure to provide safe harbor for endangered data. As University of Pennsylvania librarian Laurie Allen told the Data Rescue DC group, the vulnerabilities of federal data are not new to the Trump administration, just newly exposed.
By the end of the day, six government agencies had primers written along with 16 sub agencies. Close to 4000 URLs were seeded from the Department of the Interior and Department of Energy. And 53 datasets were harvested, adding 35 gigabytes of data to the archive. The next scheduled Data Rescue Boston will be at Northeastern University on March 24th.
Participant interviews conducted by Amanda Axel.