Unloading GitHub data by Xplenty

Tamás Srancsik
Bitrise
Published in
4 min readOct 5, 2018

We showed a possible way to create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. At Bitrise we use Xplenty for ETL tasks. Let me guide you building a package processing this data.

Since values in response from GitHub API are URLs themselves we need a couple of steps to process all the data: get, flatten and parse.

GET

Bitrise’s GitHub repositories are organized into multiple organizations. We should collect all the repositories belonging to the organizations and then apply transformations on the results of related endpoints.

REST API source is needed to get the body from the /repos endpoint. Please spend a minute looking at the settings below.

GitHub encourages us to explicitly refer to the API version by adding theapplication/vnd.github.v3+json value to the Accept header. The other header we need is Authorization. Create an OAuth application on GitHub in order to avoid rate limiting. Add its token here with token YOUR_APP_TOKEN. It is even more elegant if you set your token as a package variable. Naturally, in this case token $YOUR_VARIABLE should be the value.

$[*] JSON Expression tells Xplenty that it should process the unnamed array from the response. This array, although lacking a name, contains all repositories. The original response from the API looks like

HTTP/1.1 200 OK
Server: GitHub.com
Date: Wed, 26 Sep 2018 13:52:02 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Status: 200 OK
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1537973522
Cache-Control: private, max-age=60, s-maxage=60
Vary: Accept, Authorization, Cookie, X-GitHub-OTP
ETag: W/"edc260dca5441d160a7891d657016c89"
X-OAuth-Scopes: read:org, repo
X-Accepted-OAuth-Scopes:
X-GitHub-Media-Type: github.v3; format=json
Link: <https://api.github.com/organizations/16227531/repos?page=2>; rel="next", <https://api.github.com/organizations/16227531/repos?page=3>; rel="last"
Access-Control-Expose-Headers: ETag, Link, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval
Access-Control-Allow-Origin: *
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
X-Frame-Options: deny
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin
Content-Security-Policy: default-src 'none'
X-Runtime-rack: 0.370470
Content-Encoding: gzip
X-GitHub-Request-Id: FB6A:0AB1:578731:A0F710:5BAB8F01
[
{
"id": 33329444,
"node_id": "MDEwOlJlcG9zaXRvcnkzMzMyOTQ0NA==",
"name": "steps-github-status",
"full_name": "bitrise-steplib/steps-github-status",
"private": false,
"owner": {
"login": "bitrise-steplib",
"id": 16227531,
"node_id": "MDEyOk9yZ2FuaXphdGlvbjE2MjI3NTMx",
"avatar_url": "https://avatars3.githubusercontent.com/u/16227531?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/bitrise-steplib",
"html_url": "https://github.com/bitrise-steplib",
"followers_url": "https://api.github.com/users/bitrise-steplib/followers",
"following_url": "https://api.github.com/users/bitrise-steplib/following{/other_user}",
"gists_url": "https://api.github.com/users/bitrise-steplib/gists{/gist_id}",
"starred_url": "https://api.github.com/users/bitrise-steplib/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/bitrise-steplib/subscriptions",
"organizations_url": "https://api.github.com/users/bitrise-steplib/orgs",
"repos_url": "https://api.github.com/users/bitrise-steplib/repos",
"events_url": "https://api.github.com/users/bitrise-steplib/events{/privacy}",
"received_events_url": "https://api.github.com/users/bitrise-steplib/received_events",
"type": "Organization",
"site_admin": false
},
...

We named all the organizations the repositories of which we are interested in and collected them into one list by sequential union steps. Unfortunately, this step can only use two input tables.

The response header contains links to next pages so we can tick Use pagination at the default settings.

Even more GET

Actually, we are often not only interested in the repositories. Although this endpoint tells the number of open issues we do not know how many we’ve closed. In order to cope with such a problem or to collect information on pull requests, we initiate a new request on the endpoints named in repository data.

To process JSON data we should add the following Select components.

Curl function with a GET parameter is used to request data from an endpoint which is stored as a value. For example, pull requests against a specific repository:

Curl(
CONCAT((chararray)url, '/pulls'),
'GET',
'{"Accept":"application/vnd.github.v3+json",
"Authorization":"token $access_token"}'
)

This function stores the whole JSON response into the field of our table. In the next step, we have to extract the body from this JSON. If we named the field storing the JSON response the function extracting the array of pull requests is the one below.

JsonExtract(pulls#'body', '$[*]')

Note that just like we did with the repositories, we need to extract all in the unnamed array from the body of thepulls field. The type of pulls is a map, a sequence of key-value pairs, and we can reference the map key with #.

Flatten

Flatten() generates rows for each element of a bag. It also means that the final function in this step is

Flatten(JsonStringToBag(pulls))

JsonStringToBag() is applied first since pulls is a JSON string.

The result of this step is a new field, and each record contains one pull request. Let’s call the new field pull.

Parse

There is only one step remaining: parsing the JSON key-value pairs into new fields and values. Each field needs

  1. pull converted to JSON
  2. JSON string converted to map
  3. the reference of the field key
JsonStringToMap(ToJson(pull))#'title'

The only thing you still need is a destination table in a relational database you favour. Do not forget to add Post-action SQL commands creating indexes on your new table and records.

Originally published at blog.bitrise.io.

--

--