Analyzing GitHub issues and comments with BigQuery

(Docker, Kubernetes, TensorFlow and other top projects)

Some people have asked “can we analyze GitHub issues on BigQuery?”. Good news: While that data is not part of the new open source GitHub code on BigQuery dataset, you can find this in the classic GitHub Archive dataset.

Let’s quickly find the 10 repos with the most comments in their issues:

SELECT repo.name, COUNT(*) c 
FROM [githubarchive:month.201606]
WHERE type IN ( 'IssueCommentEvent')
GROUP BY repo.name
ORDER BY c DESC
LIMIT 10
GitHub repos with the most comments on issues June 2016

Kubernetes, Spark, OpenShift, Docker, … makes sense — except what is that “sauron-demo” project? Let’s look instead at how many different people are commenting on these projects:

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors
FROM [githubarchive:month.201606]
WHERE type IN ( 'IssueCommentEvent')
GROUP BY repo.name
ORDER BY authors DESC
LIMIT 10
GitHub repos with more authors commenting on issues June 2016

The ranking has changed: Docker was the repo with more people commenting on issues, followed by Font Awesome, React Native, VS Code, Angular, Kubernetes, npm, TensorFlow and Ansible. In this group Kubernetes is the highlight — it’s the one with more comments per user. Let‘s dig deeper into the number of comments by author:

SELECT repo.name,
ROUND(COUNT(*)/COUNT(DISTINCT actor.login),2) comments_per_author,
EXACT_COUNT_DISTINCT(actor.login ) authors,
COUNT(*) comments
FROM [githubarchive:month.201606]
WHERE type IN ('IssueCommentEvent')
GROUP BY 1
HAVING authors>400
ORDER BY 2 DESC
LIMIT 10
GitHub projects with more comments per author within top commented projects

That’s a huge difference! Within projects that have more than 400 active commenters, Kubernetes records ~35 comments per author, while other projects don’t come close.

Turns out the Kubernetes repo has 2 bots that comment a lot. I wonder what happens when we remove every author that posted more than 1000 comments during this period:

SELECT repo.name, 
ROUND(COUNT(*)/EXACT_COUNT_DISTINCT(actor.login),2) comments_per_author,
EXACT_COUNT_DISTINCT(actor.login ) authors,
COUNT(*) comments
FROM [githubarchive:month.201606]
WHERE type IN ('IssueCommentEvent')
AND actor.login NOT IN ( SELECT actor.login FROM (
SELECT actor.login, COUNT(*) c
FROM [githubarchive:month.201606]
WHERE type IN ('IssueCommentEvent')
GROUP BY 1
HAVING c>1000
ORDER BY 2 DESC
))
GROUP BY 1
HAVING authors>400
ORDER BY 2 DESC
LIMIT 10
GitHub projects with more comments per author within top commented projects (removing bots)

Kubernetes is still the top one! Now we can safely declare that the projects with the most active authors (for projects with more than 400 people commenting) are Kubernetes, TensorFlow, and Docker.

2017–02 Standard SQL update:

Top 10 big projects on GitHub by user engagement — 2017–02.
#standardSQL
SELECT repo.name,
ROUND(COUNT(*)/COUNT(DISTINCT actor.login),2) comments_per_author,
COUNT(DISTINCT actor.login ) authors,
COUNT(*) comments
FROM `githubarchive.month.201702`
WHERE type IN ('IssueCommentEvent')
AND actor.login NOT IN ( SELECT login FROM (
SELECT actor.login, COUNT(*) c
FROM `githubarchive.month.201702`
WHERE type IN ('IssueCommentEvent')
GROUP BY 1
HAVING c>1000
ORDER BY 2 DESC
))
GROUP BY 1
HAVING authors>400
ORDER BY 2 DESC
LIMIT 10

How efficient are these projects at closing issues?

Let’s look at who’s doing the best job at closing issues within these 3 top projects:

SELECT repo.name,
JSON_EXTRACT_SCALAR(payload, '$.action') action,
COUNT(*) c,
FROM [githubarchive:month.201606]
WHERE type IN ('IssuesEvent')
AND repo.name IN ('kubernetes/kubernetes', 'docker/docker', 'tensorflow/tensorflow')
GROUP BY 1,2
ORDER BY 2 DESC
Issues closed/opened/reopened for kubernetes/docker/tensorflow June 2016.

Docker shows way more issues being opened than closed in June 2016, while TensorFlow closed more issues than it opened. Kubernetes exhibits optimal behavior: Almost as many issues closed as issues opened.

This brings us to the question: Which project is more efficient at closing bugs? During June 2016 TensorFlow closed more issues than opened, while Docker exhibits the worst performance — but it’s unfair to look at only one month. This is how fast they closed issues over time:

SELECT repo.name, month,
SUM(IF(action IN ('closed'), c, null)) / SUM(IF(action IN ('reopened','opened'), c, null)) efficiency,
SUM(IF(action IN ('reopened','opened'), c, null)) opened,
SUM(IF(action IN ('closed'), c, null)) closed,
FROM (
SELECT IF(repo.name='GoogleCloudPlatform/kubernetes', 'kubernetes/kubernetes', repo.name) repo.name,
LEFT(STRING(created_at), 7) month,
JSON_EXTRACT_SCALAR(payload, '$.action') action,
COUNT(*) c,
FROM [githubarchive:month.201606],[githubarchive:month.201605],[githubarchive:month.201604],[githubarchive:month.201603],[githubarchive:month.201602],[githubarchive:month.201601], [githubarchive:year.2015]
WHERE type IN ('IssuesEvent')
AND repo.name IN ('GoogleCloudPlatform/kubernetes', 'kubernetes/kubernetes', 'docker/docker', 'tensorflow/tensorflow')
GROUP BY 1,2,3
HAVING month<'2016-07'
ORDER BY 2 DESC
)
GROUP BY 1,2
ORDER BY 1,2
Efficiency at closing issues: 1 is ideal, over 1 means more issues closed than opened.

What’s the best way to get your issue closed on GitHub?

What’s the best way to write an issue on GitHub? On the following chart we’ll see the most common ways used to open an issue — and which ones are getting more closure:

SELECT REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(payload, '$.issue.body'), r'^([A-Za-z]+ [A-Za-z]+ [A-Za-z]+ [A-Za-z]+)') start, 
COUNT(DISTINCT actor.login ) authors,
ROUND(100*SUM(JSON_EXTRACT_SCALAR(payload, '$.action')='closed')/SUM(JSON_EXTRACT_SCALAR(payload, '$.action')='opened'),1)
percentage_closed
FROM [githubarchive:month.201506]
WHERE type='IssuesEvent'
GROUP BY 1
HAVING start IS NOT null
ORDER BY 2 DESC
LIMIT 20
Most common ways to open an issue on GitHub, and % that get closed

A lot of people start their issues with “It would be nice…”, but if you want your issue closed, try something like “I am trying to”, “When I try to”, “Not sure if this”, and “I get the following”.

In summary

You can easily use BigQuery to find patterns in GitHub’s code, and in GitHub’s timeline. In this article we focused on the timeline and issues, but take a look around and analyze more trends. Follow me on Twitter and other BigQuery news on reddit.com/r/bigquery. If you haven’t tried BigQuery yet, get started in 5 minutes!

Felipe Hoffa