How to save 20 terabytes of scientific public goods

LabDAO
2 min readAug 18, 2023

--

Meta’s Metagenomic Atlas was at risk of disappearing forever. A distributed storage collaboration came together to preserve the data.

There are a number of public scientific data libraries that are heavily utilized by researchers — but what happens when an organization hosting open-sourced scientific data disbands? Last week a group of organizations began an emergency collaboration to save 20TB of scientific data that was about to disappear forever.

Protein-folding layoffs

Until last week, Meta’s research department included a BioML team. The group put out projects such as a free inference-API of ESMFold (Meta’s protein language model powered folding algorithm) and provided free, publicly accessible scientific tools such as the ESM Metagenomic Atlas — a 20TB dataset of more than 700 million predicted protein structures.

Meta’s decision to dissolve its BioML team has led to the discontinuation of the ESMFold inference endpoint, a widely used tool. Now, the future of the Atlas, highly regarded among scientists, is also uncertain. Many speculate it will be terminated in the near future.

Saving the Metagenomic Atlas

LabDAO believes in open, accessible scientific data. While the publication of free and publicly available data is valuable, ideally data accessibility shouldn’t be revoked when an organization disbands. The situation surfaces a broader question: How do organizations ensure that their publicly accessible data outlives any potential restructures, layoffs or disbandments?

We consider these data a public good and an essential resource for our the Lab Exchange, which is built on Bacalhau and IPFS.

Distributed storage is not guaranteed to last forever, but it is much more likely to outlive an organization shutting down. Last week, when news broke of the Meta layoffs, we quickly reached out to Protocol Labs (the team behind IPFS and Filecoin) with a proposal to collaborate and save this important resource.

Pinning the Atlas

Filecoin Data Infrastructure was chosen to host the IPFS node that would contain the Atlas data.

The Network Growth team within Protocol Labs took on the challenge and was enthusiastic to get involved with data preservation—no small undertaking. In total, the data includes 20GB for the metadata catalogue, 1TB for the core set and 20TB for the complete catalogue.

The project started with pinning the first “hello world” chunk of data — Martin Steinegger’s Foldseek 100GB, and making it retrievable on IPFS.

Benjamin Arntzen, a Solutions Architect at Protocol Labs and contributor to Bacalhau, has been working on the problem. As of publishing, they have created a 17TB tarball collection, which will soon be organized to contain all 34TB of the artifacts, sorted into folders.

UPDATE: You can now explore the Metagenomic Atlas data

--

--

LabDAO

The future of biomedical science is an open, community-run network of wet & dry laboratories