GitHub and the Power of Open Source Data

Jesus Rodriguez
Google Cloud - Community
3 min readJun 30, 2016

Yesterday, Github announced that was making available the activity data for 2.8 million open source repositories as public datasets in Google Big Query. This will allow users to execute SQL queries against Github data and provide sophisticated, near-real time analytics about open source projects.

Yesterday’s announcement is another example of GitHub’s commitment to open source data. In 2012, GitHub announced the release of the GitHub Archive project which provided an initial set of analytics and insights about the way developers us GitHub. The BigQuery datasets can be considered, in many ways, an extension to the GitHub Archive project.

GitHub’s BigQuery data set in an incredibly valuable source of data to understand the characteristics of open source projects. By analyzing the data using BigQuery, we can determine interesting usage patterns of a specific project, contributors, user preferences etc. However, I think the GitHub announcement has a more profound meaning if we think about what can achieved if other companies follow the same path.

Yesterday, GitHub did a little bit more than just releasing a bunch of historical data sets. By leveraging BigQuery, GitHub released the data in a platform that is optimized for analytical workloads and also interoperates with some of the most popular analytic platforms in the market.

I certainly applaud GitHub for the thoughtful way they approached the released of their proprietary data sources. I think this release is certainly going to expand the conversation about open source data. Here are a few interesting points that I think are worth considering:

Other Companies Should Consider Open Sourcing Data Sets

If more companies follow GitHub’s example, we could soon have a public marketplace in which data can be aggregated in real time using simple SQL constructs. Imagining combining, GitHub’s project data with data from LinkedIn to correlate to correlate the job history of a developer with his open source commitments.

APIs are not Always Enough

Many companies make their data available through APIs but that’s rarely enough to enable sophisticated analytics. For starters, more APIs just provide the current view of a specific data asset and don’t focus on historical data. Additionally, most APIs don’t use data access protocols that are compatible with analytic tools.

A Common Platform to Access and Combine Data Sources

Open sourcing data is more than just releasing a bunch of CSV files for download. By leveraging a platform like Google’s BigQuery, companies can now combine and aggregate data using simple SQL constructs and expose those queries as new data sources that can, in turn, be used in other queries to enable more sophisticated analytic workloads.

Open Sourcing Data and Obtaining More Intelligence

Data is one of the most precious assets of any company. From that perspective, we can think it would be crazy for many institutions to release their data sources. However, by making their data available and allowing data scientists to combine it with other public or private data sources, organizations can obtain new insights and intelligence about their business in levels that were not possible before.

We Heard this Argument About Code a Few Decades Ago

From privacy to IP challenges, there are many arguments that can be used against open sourcing data. For the most part, those arguments are very similar to the ones that have been used against open sourcing code for decades. While today the benefits of open source code are undeniable, there were not well understood a few decades ago. Similarly, I believe open source data will have to go through several iterations before we fully capitalize on the benefits.

--

--

Jesus Rodriguez
Google Cloud - Community

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...