The Future of Sharing Large, Scientific Datasets is Here

KJ Schmidt
3 min readJan 25, 2023

--

You can finally share your data headache-free.

Galaxy Brain meme image by Jon Manning from Wikipedia Commons

It’s taken months but you’ve finally done it. You’ve created a dataset. Not just any dataset; It’s extremely relevant to your field and could be a catalyst for further developments in science. You’ve collected the data, cleaned it, formatted it, and doubled checked everything a million times. It’s beautiful!

Good news/bad news: it’s massive. That’s great for research, but not so great for being able to share it with anyone. You’re also not sure how to get the word out about it. How can people use it if they can’t discover or access it?

Welp, all that effort for little impact

Not so fast! We’ve filled many a gripes.txt with these complaints (thank you, Logan Ward) and created a platform with accessibility, reproducibility, and collaboration as the driving forces. Introducing: Foundry-ML.

Foundry-ML is an open source machine learning platform for scientists. We host large, scientific datasets and make them easy to find and use.

Our catalogue of datasets is available to browse on our website. Once someone finds your dataset, they can load it with just a couple lines of Python code. (We have another article that goes into how to access datasets on Foundry-ML if you’d like more details.)

Why is this worth my time?

🗄 Our infrastructure is made with large datasets in mind. We can host data that can’t be shared with common tools (email, slack, Github, DropBox), saving you the frustration (and time!) of figuring out how to share your data.

👀 Get your work in front of more people. Our interface makes it easy for people to discover your dataset.

👩‍💻 Your dataset gets its own page on our website. Share all the details about your data (including how to use it!) with a simple link.

🔑 Accessibility is extremely important to us. We make sure that people can actually access and use your dataset. No gatekeeping. Publishing and accessing data is free.

How difficult is it to publish my data to Foundry-ML?

I’m sure this all sounds great in theory, but you’re wondering how difficult the publication process is. We’ve made that as easy as possible. We have a notebook that can be run in Google Colab or locally with Jupyter that walks you through the entire publishing process. You can fill it in with your information and publish from there. Our documentation can answer any questions as you go.

The basics you’ll need are:

What are you waiting for?

Get your data on Foundry-ML and start sharing!

Foundry-ML is open source. Come help us build 💪 by contributing to our codebase or give us feedback on Github 📢 . We love hearing from you!

Acknowledgements

Foundry-ML is built with Globus and the Materials Data Facility.

This work was supported by the National Science Foundation under NSF Award Number: 1931306 “Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure” and under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD).

--

--

KJ Schmidt

ML Scientist at UChicago + Argonne National Lab. Talking too much to my dog. Being a “cool” aunt. I like knowing things. She/Her