Project Svalbard: A metadata vault for research data
For the last two months Code For Science (creators of the Dat Project) have been working with the teams at Data.gov, Data Refuge, the Internet Archive and the California Digital Library to aggregate the government data that has been downloaded so far as part of #datarefuge and create a single metadata dataset. Today we are releasing 38GB of metadata, over 30 million hashes and URLs of research data files.
We are calling this initiative Project Svalbard. The goal is to create a global metadata vault for public research data, especially those at risk of disappearing. The initiative is named named after the Svalbard Global Seed Vault in the Arctic, which is an underground physical vault whose mission is:
to preserve a wide variety of plant seeds that are duplicate samples, or “spare” copies, of seeds held in gene banks worldwide
A metadata vault that spans the globe
There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.
To aid in this effort we have assembled the following metadata as part of the Svalbard v1:
- 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
- 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
- All metadata from Data.gov, about 2.1 million datasets
- A list of ~750 .gov and .mil FTP servers
There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.
Following the principles set forth by the librarians behind Data Refuge we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With Project Svalbard we are working on curating metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.
We are hoping the community can use this data in the following ways:
- To independently verify that the mirroring processes that produced these hashes can be reproduced
- To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
- To seed additional web crawls or scraping efforts with additional dataset source URLs
- To encourage other archiving efforts to publish their metadata in an easily accessible format
- To cross reference data across archives, for deduplication or verification purposes
What about the data?
This initial release of 30 million hashes and urls is just the metadata. The actual content (how the hashes were derived) are stored on either the Internet Archive or on California Digital Library servers. The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to servers maintained by California Digital Library.
During this process we discovered that of the ~2.5 million datasets on Data.gov, about 1/3rd of the linked HTTP resources point at HTML web pages. The other 2/3rds point at so called ‘raw data’, e.g. CSV files, XLS, ZIP, PDF, etc. The 40TB metric comes from the total size of the directly linked resources. We did not crawl any of the HTML landing pages, which we expect contain many petabytes of information (based on estimates from NASA) but to access those data may require custom scrapers or more manual data access methods.
We are working on a Dat based system for access to all ~160TB of data in the future. If you’d like to help, consider building a DataSilo home data mirroring system and keep an eye on the @dat_project twitter for details.
We are using the Dat Protocol for distribution so that we can publish new Svalbard metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.
If you are a technical user you can report issues or get involved at the Svalbard GitHub.
If you have suggestions or questions, you can ask a question in the Code for Science Community Chat.