What Makes a Good Dataset?

KJ Schmidt
6 min readJul 3, 2023

--

There is a lot of data in the world. The toughest part is making sense of it. What can be done to take a dataset from merely existing to being a valuable resource?

A good place to start is considering FAIR data principles.

FAIR data principles image from https://biosistemika.com/

Data should be Findable, Accessible, Interoperable, and Reusable. Data that includes FAIR data principles will have a much more significant impact on the scientific community. Let’s move science forward with FAIR!

That all sounds great, but how exactly can you make data FAIR? We have a few tips to get you started.

Describe wherever you can

Metadata matters. If you’re submitting your data to a platform and there is suggested metadata, use it! This is a great way to provide information that you may not think to include on your own. Users often drive the inclusion of certain fields, so you can make your dataset more useful and discoverable using fields that have a proven track record of enhancing datasets.

Descriptive title

Let’s move past final_FINAL_forRealThisTime_v5. That doesn’t tell me anything about your dataset except that you’ve had a difficult time exporting the right version. Keep your title concise and informative. Remember that people looking at your dataset don’t have all of the context that you have.

Descriptive keys

Have you ever looked at a dataset and not been able to figure out what’s going on? Let’s avoid that with your dataset. The best way to do that is to use keys and labels that make sense to people other than yourself. Are there standard labels in your domain? Use them! Again, think of people who don’t have all the context you have. Help them out so your data can be useful to all.

Description and documentation

A general description of your dataset is extremely helpful. This can be included in a README or whatever documentation you have available. If someone wasn’t a part of collecting the data, they won’t be familiar with your process and therefore, won’t know the ins and outs of the dataset (ie. that clever shorthand you came up or what unit of measurement you used). Think about including:

  • When and how the data was collected
  • Size of the dataset
  • Any processing you’ve done to the data
  • Data fields and a description elaborating on each field
  • Explanations of any acronyms and abbreviations
  • Author names, institutions, and contact information

Think of other users

Get your dataset to a place where someone besides yourself can use it. Put yourself in their shoes: if you were seeing this dataset for the first time, how would you most likely be using it? Prepare the data in a way that makes that possible.

Clean the data

Are you using any sort of preprocessing step before using the data yourself? Removing duplicate values, removing incomplete entries, converting values to the same unit of measurement, etc. Consider if this preprocessing will be useful to others and will make the dataset more usable.

Structure the data

If applicable, adding structure to your dataset can be incredibly useful. If you are adding structure to your data to use it, consider keeping that structure when sharing the data with others.

Data Prep Checklist

At the Materials Data Facility, we’ve seen what makes a dataset high quality and FAIR, and what doesn’t. Here’s a checklist that summarizes the points above more concisely and comes right from our dataset standards. We’ve found that these standards make a significant difference in usability and interpretability.

Our Dataset Standards at the Materials Data Facility

  • Provide a README file and description. This should describe the contents of the dataset, the layout of the directories, file naming schemes, the size of the dataset, and links to any related publications, codes, etc.
  • Describe Data Provenance: Include in the description, information about who, what, where, when, how, and why the data were collected.
  • Detail Data Quality: Document data collection methods, validation procedures, and any known biases or limitations to provide context and support for users.
  • Use Open File Formats: When possible, data should be shared in formats that are open and readable by common software packages.
  • Provide Examples: When possible, include examples of how to load, analyze, and plot your data. These examples could be included in the repository or linked in e.g., a GitHub repository.
  • Detail Data Privacy and Ethical Considerations: Address any privacy concerns or ethical considerations related to the dataset, and ensure compliance with relevant regulations.
  • Add Licensing Information: Specify the license under which the dataset is distributed, detailing any usage restrictions or requirements.

Your dataset is prepped and ready for the world! Now how do you share it?

Congratulations on getting your dataset ready and documented. That’s hard work! Time to pick a platform to share your work.

Find a platform that makes sense for sharing and accessing in your community. Not sure where to start? Don’t worry, we have a few we recommend.

Globus

For easily sharing data with others, Globus is a great option. All you need to do is set up an endpoint on your personal laptop (or whatever computer the data is stored on) and transfer the data to another endpoint via Globus. This requires the person you’re sharing the data with to set up their own endpoint where they would like the data live.

The transfer page on Globus. Looking at data on my own endpoint before selecting an endpoint to transfer to.

Required:

  • To publish data: A (free) Globus account
  • To use data: A (free) Globus account

Materials Data Facility (MDF)

If you’d like to make your data more discoverable and accessible, consider using the Materials Data Facility (MDF). MDF uses Globus to transfer data, but doesn’t require people using data to create their own endpoints. Each dataset on MDF can be found through the search page, which allows your data to reach more people. Datasets also get their own unique page on the website, making it easy to share with others by simply sharing the link.

A dataset page on MDF with dataset information and a “Get the Data” button

Required:

  • To publish data: A (free) Globus account
  • To use data: Nothing

Foundry-ML

Is your data structured and ready to be used programmatically? Sounds like a fit for Foundry-ML!

Foundry-ML is built with MDF, so Foundry-ML datasets get all same benefits as MDF (appearing in MDF search and its own website page) with even more accessibility features. Foundry-ML datasets can be loaded directly into a DataFrame with a Python SDK. They also get a more detailed page on the Foundry-ML website, with step by step instructions on how to use.

A dataset page on Foundry-ML with dataset information and instructions on how to use

Required:

  • To publish data: A (free) Globus account
  • To use data: Nothing

The FAIRest data of them all

Making a good dataset is all about following FAIR data principles: describing and documenting; making data useable to others; and hosting on platforms that make data accessible. Use these tips with your data and let us know how it goes!

Acknowledgements

The Materials Data Facility is built with Globus.

CHiMaD Phase I: This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).

CHiMaD Phase II: This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).

Foundry-ML is built with Globus and the Materials Data Facility.

This work was supported by the National Science Foundation under NSF Award Number: 1931306 “Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure” and under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).

--

--

KJ Schmidt

ML Scientist at UChicago + Argonne National Lab. Talking too much to my dog. Being a “cool” aunt. I like knowing things. She/Her