Releases Data on Over 25m Open Source Software Repositories

Today’s software relies on a core set of of free, openly licensed components, frameworks and systems. But our shared, digital infrastructure is under threat. It’s overburdened and under-supported.

Nadia Eghbal’s Roads and Bridges study for the Ford Foundation gave us a series of personal vignettes on the state of open source — stressed maintainers, fractured communities and financial trouble. The stories we read resonated with our own experiences, our concerns legitimised and amplified.

Something has to change.

For nearly three years, has been gathering data on the complex web of interdependency that exists in open source software. We’ve published a series of experiments using harvested metadata to highlight projects in need of assistance, projects with too few contributors and too little attention.

Many of these are projects that are essential to making much of today’s software work. And these are projects that need our help. This problem is already attracting the attention of our community, as shown by this month’s Sustain conference in San Francisco.

Today is releasing data on over two million unique infrastructure-type projects under a permissive licence. We believe that this information will empower and accelerate the work of those seeking solutions, and kickstart the conversation as to what will lead a sustainable economy for key open source projects.

The data is available in its raw format at on Zenodo and soon we’ll be publishing a structured, queryable dataset on Google’s BigQuery. This data is published under a Creative Commons BY-SA-4.0 licence. It’s an open and permissive licence that commits the user to re-distributing their work, and their understanding.

We’ll be updating this data on a regular basis. Find out more at

As a community, we’re yet to agree a way to ensure the continuing health of individual open source projects. But we hope this will be another step in the right direction.