Unloading GitHub data by Xplenty
We showed a possible way to create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. At Bitrise we use Xplenty for ETL tasks. Let me guide you building a package processing this data.
Since values in response from GitHub API are URLs themselves we need a couple of steps to process all the data: get, flatten and parse.
GET
Bitrise’s GitHub repositories are organized into multiple organizations. We should collect all the repositories belonging to the organizations and then apply transformations on the results of related endpoints.
REST API source is needed to get the body from the /repos
endpoint. Please spend a minute looking at the settings below.
GitHub encourages us to explicitly refer to the API version by adding theapplication/vnd.github.v3+json
value to the Accept
header. The other header we need is Authorization
. Create an OAuth application on GitHub in order to avoid rate limiting. Add its token here with token YOUR_APP_TOKEN
. It is even more elegant if you set your token as a package variable. Naturally, in this case token $YOUR_VARIABLE
should be the value.
$[*]
JSON Expression tells Xplenty that it should process the unnamed array from the response. This array, although lacking a name, contains all repositories. The original response from the API looks like
HTTP/1.1 200 OK
Server: GitHub.com
Date: Wed, 26 Sep 2018 13:52:02 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Status: 200 OK
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1537973522
Cache-Control: private, max-age=60, s-maxage=60
Vary: Accept, Authorization, Cookie, X-GitHub-OTP
ETag: W/"edc260dca5441d160a7891d657016c89"
X-OAuth-Scopes: read:org, repo
X-Accepted-OAuth-Scopes:
X-GitHub-Media-Type: github.v3; format=json
Link: <https://api.github.com/organizations/16227531/repos?page=2>; rel="next", <https://api.github.com/organizations/16227531/repos?page=3>; rel="last"
Access-Control-Expose-Headers: ETag, Link, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval
Access-Control-Allow-Origin: *
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
X-Frame-Options: deny
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin
Content-Security-Policy: default-src 'none'
X-Runtime-rack: 0.370470
Content-Encoding: gzip
X-GitHub-Request-Id: FB6A:0AB1:578731:A0F710:5BAB8F01[
{
"id": 33329444,
"node_id": "MDEwOlJlcG9zaXRvcnkzMzMyOTQ0NA==",
"name": "steps-github-status",
"full_name": "bitrise-steplib/steps-github-status",
"private": false,
"owner": {
"login": "bitrise-steplib",
"id": 16227531,
"node_id": "MDEyOk9yZ2FuaXphdGlvbjE2MjI3NTMx",
"avatar_url": "https://avatars3.githubusercontent.com/u/16227531?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/bitrise-steplib",
"html_url": "https://github.com/bitrise-steplib",
"followers_url": "https://api.github.com/users/bitrise-steplib/followers",
"following_url": "https://api.github.com/users/bitrise-steplib/following{/other_user}",
"gists_url": "https://api.github.com/users/bitrise-steplib/gists{/gist_id}",
"starred_url": "https://api.github.com/users/bitrise-steplib/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/bitrise-steplib/subscriptions",
"organizations_url": "https://api.github.com/users/bitrise-steplib/orgs",
"repos_url": "https://api.github.com/users/bitrise-steplib/repos",
"events_url": "https://api.github.com/users/bitrise-steplib/events{/privacy}",
"received_events_url": "https://api.github.com/users/bitrise-steplib/received_events",
"type": "Organization",
"site_admin": false
},
...
We named all the organizations the repositories of which we are interested in and collected them into one list by sequential union steps. Unfortunately, this step can only use two input tables.
The response header contains links to next pages so we can tick Use pagination
at the default settings.
Even more GET
Actually, we are often not only interested in the repositories. Although this endpoint tells the number of open issues we do not know how many we’ve closed. In order to cope with such a problem or to collect information on pull requests, we initiate a new request on the endpoints named in repository data.
To process JSON data we should add the following Select
components.
Curl function with a GET
parameter is used to request data from an endpoint which is stored as a value. For example, pull requests against a specific repository:
Curl(
CONCAT((chararray)url, '/pulls'),
'GET',
'{"Accept":"application/vnd.github.v3+json",
"Authorization":"token $access_token"}'
)
This function stores the whole JSON response into the field of our table. In the next step, we have to extract the body from this JSON. If we named the field storing the JSON response the function extracting the array of pull requests is the one below.
JsonExtract(pulls#'body', '$[*]')
Note that just like we did with the repositories, we need to extract all in the unnamed array from the body
of thepulls
field. The type of pulls
is a map, a sequence of key-value pairs, and we can reference the map key with #
.
Flatten
Flatten()
generates rows for each element of a bag. It also means that the final function in this step is
Flatten(JsonStringToBag(pulls))
JsonStringToBag()
is applied first since pulls
is a JSON string.
The result of this step is a new field, and each record contains one pull request. Let’s call the new field pull
.
Parse
There is only one step remaining: parsing the JSON key-value pairs into new fields and values. Each field needs
pull
converted to JSON- JSON string converted to map
- the reference of the field key
JsonStringToMap(ToJson(pull))#'title'
The only thing you still need is a destination table in a relational database you favour. Do not forget to add Post-action SQL
commands creating indexes on your new table and records.
Originally published at blog.bitrise.io.