Is Microsoft the biggest OSS contributor?

Disclosure: I’m a Google employee and OSS lover and these are my personal views.

You may have seen the news this week: Microsoft is the biggest OSS contributor!

an example of the kind of news I’m talking about

I celebrate the great involvement of Microsoft and how they’ve evolved to really embrace open source. I use VSCode every day and I’m looking forward to the next products they’ll open source.

But such an absolute headline made my statistic sense tingle. If you read the sources and look at the queries GitHub used to obtain these rankings, you’ll see that what they actually measured is how many users interacted with a repository owned by Microsoft.

the relevant graphic from the octoverse study

GitHub published the queries used to obtain these rankings, you can see them here. They did not count how many employees Microsoft has on GitHub, but how many GitHub users interacted somehow (with comments, issues, PRs, etc) on a repository owned by Microsoft. You can see the results here too.

This is an interesting metric, but it was reported incorrectly by journalists.

number of legs is a bad predictor of speed: graphical example

We should also take into account that not all open source projects choose GitHub as their code hosting platform. Notable examples of this are GNU, Mozilla, or the Apache foundation. But let’s ignore that for now and measure only what we can see on GitHub.

Fortunately, it’s possible for us to do some quick research to explore what journalists thought they were reporting. We have a dataset of all the open source on GitHub in BigQuery — Felipe has written about this dataset, and I also used it to analyze the popularity of Go packages — let’s see what we can do with that.

Using this dataset, let’s try a different analysis where we are going to count the commits created by people from an organization instead. As a first step I’m just going to count how many commits we have.

SELECT count(*) as n
FROM [bigquery-public-data:github_repos.commits]

Ok, so we have 155 million commits … that’s a lot, what about during the last year only?

SELECT count(*) as n
FROM [bigquery-public-data:github_repos.commits]
WHERE DATE_ADD(committer.date, 1, 'YEAR') > NOW()

Wow! Still 35 million commits!

I’m going to make some assumptions now:

  • Google employees commit using their @google.com emails (probably true for most, not me since I mostly use my @golang.org)
  • Microsoft, Facebook, and other companies do the same.

So … how many of these commits are done by people with Google accounts?

SELECT count(*) as n
FROM [bigquery-public-data:github_repos.commits]
WHERE DATE_ADD(committer.date, 1, 'YEAR') > NOW()
AND REGEXP_EXTRACT(author.email, r'.*@(.*)') == 'google.com'

What about others? Let me extract the organization from each one of the commits, group all the commits per org, and then count them and sort them by the most to least active!

SELECT REGEXP_EXTRACT(author.email, r'.*@(.*)') as org, count(*) as n
FROM [bigquery-public-data:github_repos.commits]
WHERE DATE_ADD(committer.date, 1, 'YEAR') > NOW()
GROUP BY org
ORDER by n DESC
LIMIT 1000
note: emails on GitHub commits are not verified, therefore localhost, for an explanation of users.noreply.github.com visit this.
The biggest OSS contributor is people! Not orgs 🎉

Is counting commits the best way of measuring OSS contributions? Absolutely not! Open source is about more than commits, it’s about helping each other on issues, commenting on Pull Requests, and even about other forums other than GitHub and even offline communication.

So, is Microsoft the biggest OSS contributor? Yes, in a way, if you take into account only organizations and it depends on the metric you choose. As usual, data can be used to show many different things. Let’s be careful to choose ones that will not encourage a climate of competition over collaboration.