Exploring GitHub events with Azure Data Explorer

In this post, I’m going to demonstrate an easy way to explore GitHub events and discover interesting insights within the billions of events they publish.

I will use Azure Data Explorer (ADX) which was just released to public preview at Microsoft’s Ignite 2018 (you can read more about it here).

For those of you who have watched the keynote and saw the demo live, use the instructions below to recreate the demo on your own Azure account. You’ll also need Python installed and a GitHub account.

There are two ways to get and query GitHub events:
1. Live Stream (GitHub’s official API)
2. Archive Data (credit to GHArchive)

All code samples (Python) are free and can be conveniently accessed in this Gist.

Exploring Live Stream

Stack: Azure Event Hub, Azure Data Explorer, Python

Most of the steps are covered in the ADX documentation.

  1. Set up your Azure Data Explorer cluster
    Follow the steps and create your cluster and database.
  2. Set up EventHub and integrate it with Azure Data Explorer
    Follow these steps for creating an EventHub.

At the “creating a target table” section, use the following script to match the table with GitHub’s Event Type:

.create table GithubEvent ( Id:int64, Type: string, Actor: dynamic, Repo: dynamic, Payload: dynamic, Public:bool, CreatedAt: datetime)

Define the mapping:

.create table GithubEvent ingestion json mapping "GitMapping" '[{"column":"Id","path":"$.id"},{"column":"Type","path":"$.type"},{"column":"Actor","path":"$.actor"},{"column":"Repo","path":"$.repo"},{"column":"Payload","path":"$.payload"},{"column":"Public","path":"$.public"},{"column":"CreatedAt","path":"$.created_at"}]'

Now you can query the table and make sure it is properly set up.

3. Use GitHub’s Public API

Use a simple Python script to poll for new events and send them to EventHub.(Link to working code sample.)

You can of course write your own, but there are some edge cases handled in the script that can be tricky (event size, duplicate events, etc.)

To run the script, install the following requirements:

pip install requests azure-eventhub

As you can see in the script, to get a proper throttle limit you need to have an API token (otherwise you only get a few requests per hour), which can be created as explained here.

While running the script you should see a similar output:

python crawl_send_eventhub.py

4. Wait for data 
Make sure that data is flowing. When checking the Azure portal, you can see that events are arriving:

Event Hub overview in Azure Portal

5. Query

Query Azure Data Explorer:

GithubEvent | count

We can also make sure events are well formatted:

GithubEvent | limit 1

6. Explore

Learn a bit more about Azure Data Explorer Query language or just enjoy some of our example queries.

Exploring Archive Data

Stack: Azure Storage, Azure Data Explorer, Python

I prepared some scripts to get archived data and ingest it quickly. Since data isn’t streaming, we can skip setting up EventHub and do the following (you can also skip steps 1 and 2 if you completed them in the previous section):

  1. Setup Azure Data Explorer cluster: Skip this step if you already completed it)
  2. Create table with the following script (skip this step if already defined):
.create table GithubEvent ( Id:int64, Type: string, Actor: dynamic, Repo: dynamic, Payload: dynamic, Public:bool, CreatedAt: datetime)

And define mapping:

.create table GithubEvent ingestion json mapping "GitMapping" '[{"column":"Id","path":"$.id"},{"column":"Type","path":"$.type"},{"column":"Actor","path":"$.actor"},{"column":"Repo","path":"$.repo"},{"column":"Payload","path":"$.payload"},{"column":"Public","path":"$.public"},{"column":"CreatedAt","path":"$.created_at"}]'

3. Create an Azure Storage Container

We need to store the data on Azure Storage so we can easily ingest them to Azure Data Explorer afterwards, and so that we will have them persisted.

Follow the steps here and then create a container following these steps.

4. Download data from GHArchive and upload to Azure Storage

To run the script, install requirements:

pip install requests azure-kusto-ingest

This uses a simple Python script to download data and upload it to Azure Storage. Files are first downloaded locally to your machine, so it may run out of space. (Link to working code sample.)

The script is built so it can resume upon failure. If needed, re-run the script and it will look up previously downloaded data.

You can also change the data range you wish to download by easily adjusting the ranges in this code.

5. Once data is uploaded to storage, ingest it to Azure Data Explorer

To run the script, install requirements:

pip install requests azure-kusto-ingest

Link to working code sample

Now make sure that data is ready. Note: This can take a while, as these are massive amounts of data. You can query the data while it is being ingested using Azure Data Explorer!

We can query Azure Data Explorer:

GithubEvent | count

Make sure events are well formatted:

GithubEvent | limit 1

When loading the entire archive, we should see quite a large dataset:

entire gharchive since 2016

Samples

Now, lets look at some cool graphs!

let watchedRepos =  
GithubEvent
| where Type == "WatchEvent"
| extend RepoName = tostring(Repo.name)
| summarize hint.shufflekey=RepoName WatchEvents=count() by RepoName;
let issuesRepos =
GithubEvent
| where Type in ("IssueCommentEvent", "IssuesEvent")
| extend RepoName = tostring(Repo.name)
| summarize hint.shufflekey=RepoName IssueEvents=count() by RepoName;
watchedRepos
| join hint.shufflekey=RepoName (issuesRepos) on RepoName
| extend Popularity=(WatchEvents * 0.3) + (IssueEvents*0.7)
| top 5 by Popularity
| project RepoName, Popularity
| render barchart

The above will produce the following chart:

// What upcoming languages should I concentrate on learning?
// notice: subqueries, CTEs, join, top, calculated columns, ...
let languages =
GithubEvent
| where Type == "PullRequestEvent"
| extend language=tostring(Payload.pull_request.head.repo.language);
let lastYear = languages | where CreatedAt between(datetime(2017-08-01) .. 30d) | summarize LastYearCount=count() by language;
let thisYear = languages | where CreatedAt between(datetime(2018-08-01) .. 30d) | summarize ThisYearCount=count() by language;
lastYear
| join (thisYear) on language
| extend diffPct = (ThisYearCount - LastYearCount) * 100.0 / LastYearCount
| where diffPct > 50.0 and ThisYearCount > 2000
| top 5 by ThisYearCount
| project-away language1
| project language , diffPct
| render columnchart

The above gives us the increase since last year:

And one last query; let’s see Microsoft’s rising tech on GitHub:

GithubEvent 
| where CreatedAt >= ago(180d)
| where Type == "WatchEvent"
| where (Repo.name startswith 'Azure' or Repo.name startswith 'Microsoft')
| summarize Count=count() by CreatedAt=bin(CreatedAt, 30d), Repo=tostring(Repo.name)
| order by Repo asc, CreatedAt asc
| serialize
| extend PrevCount = prev(Count), PrevRepo = prev(Repo)
| extend DiffRatio = iif(Repo == PrevRepo, (Count - PrevCount)*100.0 / PrevCount, 0.0)
| where DiffRatio > 30.0 and Count > 150
| summarize arg_max(CreatedAt, DiffRatio, PrevCount, Count) by Repo
| top 10 by DiffRatio
QuantumKatas is on fire!

Conclusion

That was a quick walk trough into using Azure Data Explorer to query GitHub Events. Please feel free to suggest new interesting queries and ask questions in the comments section!