If You Work in “Small Science,” Are You Leveraging Data Repositories?

Data repositories can help scientists with minimal resources make their work findable and citable

Andrew Mckenna-Foster
The Startup
4 min readAug 6, 2020

--

Photos by: Goran Ivos and fabio on Unsplash

If you’re a scientist, especially one performing a lot of your research alone, you probably have more than one spreadsheet of important data that you just haven’t gotten around to writing up yet. Maybe you never will. Sitting idle on a hard drive, that “dark data” could prove very useful to someone in the future (or even someone in the present), especially as our climate and society changes.

What are you going to do with those files? How are you going to preserve them?

If you’re like me, maybe you’ve felt the terror of losing data every time you moved your files to a new computer or moved your research to a new job. Did you remember to back up that spreadsheet from your brilliant pet project from 7 years ago? If you did back it up, are you sure you backed up the most recent version? It’s sobering to imagine other people have gone through this and lost potentially valuable species records, survey data, and field observations.

Digital Data Repositories to the Rescue

In the years before I returned to graduate school, I worked for a science nonprofit on Nantucket Island, Massachusetts, and this problem haunted me all the time. Over nearly a decade there, I accumulated spreadsheets filled with very localized, ecological data, but had no way to organize it, save it, and share it. Fortunately, a solution is emerging in the form of digital repositories backed with robust metadata schemes and indexing services. Importantly, some of these repositories are accessible to everyone, and no university affiliation is required.

In May 2020, Meghan Mitchell, Christopher Tillman Neal and I launched a digital repository for the Nantucket Biodiversity Initiative (NBI). The repository stores and protects environmental and ecology research data from around Nantucket, but it is focused on projects funded by NBI. Visit the Nantucket Biodiversity Digital Repository and browse through the files to learn about bat counts, spider surveys, sandplain grassland research, and much more.

A snapping turtle on Nantucket Island, Massachusetts
A snapping turtle on Nantucket. Over half of Nantucket Island is conservation land and scientific species inventories date back to the late 1800’s. There is a wealth of information that would benefit from being published to a repository. Photo: Andrew Mckenna-Foster

We used Zenodo, a free platform that allows anyone to upload research related files. Zenodo stores the files forever, makes them searchable on the internet, and even gives them a digital object identifier (DOI). However, uploading your files to a repository is the easy part of the solution; to make data useful far into the future, it is crucial to follow the core principles of data publishing and sharing. Uploading data with no context makes it one more piece of junk in the vastness of the internet.

Documenting Data is Difficult but Absolutely Essential

Published data should be FAIR: Findable, Accessible, Interoperable, and Reusable. In practice, this means

  • Describing the data with a solid description, useful keywords, and author information (metadata)
  • Using a standard metadata scheme so that the information can be easily shared
  • Uploading the files in an open format (like CSV)
  • Licensing the data so that people and machines will understand how the data can be used.

That is only the bare minimum. While Zenodo and other free repository platforms like figshare and Dataverse simplify this process, it still requires work and planning.

The meat of our project was working with NBI to create a workflow that curates and applies metadata to all reports and datasets before publication. If you want to set up a repository for yourself or your organization, this is where you should focus most of your energy. We built a documentation site on GitHub that describes the process in detail and is free to copy.

So, What are the Outcomes?

The repository is growing as we curate and upload reports and data going back to 2005. More importantly,

  • NBI now has a permanent, accessible, and shareable library of the research it has supported.
  • Researchers who work on or near Nantucket now have a way to publish their data and reports.
  • People looking for data and information for the area can now browse current and past research. Importantly, they can cite any information they use, giving authors the credit they deserve.
  • I can sleep at night knowing the data I spent years collecting has a permanent home.
Charts showing what types of files have been uploaded to the digital repository
A summary of the repository as of August 2020. We use Zenodo’s API to harvest metadata from the Nantucket Biodiversity Digital Repository for visualization using Python. These charts are only possible because the workflow we designed controls how keywords are assigned.

As NBI continues to support research and add files to this repository, publishing the raw data, not just a project report, will be especially important. With that data in hand, researchers in 10, 50, or 100 years will be able to reproduce and directly compare data from species surveys, population surveys, and management regimes.

The Repository Is Already Being Used

The icing on the cake is that since the repository became operational, it has already proven useful: I recently shared a dataset on Nantucket tarantulas with another spider researcher who was looking for a way to cite our observations.

I hope you consider publishing your data whenever possible and choose to follow the FAIR principles. The open science community is growing rapidly and offers numerous resources for anyone to get started. I am always open to questions and collaborations so please contact me if you’re interested in working together.

--

--