Serverless git repo history analysis

Serverless time travel is possible now :)

I have just published my sample project demonstrating how to easily perform automatic code review using Github API.

It is based on the other post I have written some time ago.

In this particular example I am analysing project history looking for “hot spots”.

You can find an introduction in README but to summarise shorty:

Hot-spots are the files which are the most frequently edited in the project history.
Modifying such files may introduce a potential bug. It also signals that the file may violate good design practices like single responsibility principle, especially if it’s a large file.
Analysing a history of changes in a project may lead to many more interesting discoveries than just static source code analysis.
The subject is called “code forensics” and more about it can be found in the great book Your Code as a Crime Scene by Adam Tornhill.

In this post I’d like to concentrate more on the code explanation and configuration.

Github webhook

The first thing is what is Github webhook and how to configure it.
In short, it is an HTTP request which Github invokes to a predefined URL on certain actions, e.g. on Pull Request creation or edition.

You can configure it independently for each project in project settings.

Webhook configuration

What is needed for configuration is an endpoint, which we will develop using Serverless Framework and AWS API Gateway + Lambda, and a random token called “Secret”

Webhook creation

We can generate the secret e.g. using ruby in shell:

ruby -rsecurerandom -e ‘puts SecureRandom.hex(20)’

Once we develop the service we will update the URL.

Webhook endpoint.

We can generate a java application with Serverless Framework using:

sls create — path webhook-service — name webhook-service — template aws-java-gradle

It will generate sources in folder webhook-service.
By default the service will be called “hello” which we’d maybe prefer to call differently. We may update it in two places:

  • in build.gradle
// set the base name of the zip file
baseName = “hello”
  • in serverless.yaml
package:
artifact: build/distributions/hello.zip

The package will no longer be called hello.zip if we updated build.gradle with different baseName.

What we need to do next is to configure two Lambda functions in serverless.yaml and give permission for Lambda to call another function.

The file will look as follows:

service: webhook-service
provider:
name: aws
runtime: java8
region: us-east-1
// configuring permissions to invoke one lambda from another
iamRoleStatements:
- Effect: Allow
Action:
- lambda:InvokeFunction
Resource: "*"
package:
artifact: build/distributions/webhook-service.zip
functions:
webhook:
handler: com.serverless.ApiGatewayHandler
timeout: 30 # max API Gateway timeout
events:
- http:
path: webhook
method: post
cors: true
job:
handler: com.serverless.Job
timeout: 900 # 15 min.

I gave maximum possible timeouts for the functions.

Then we can move to Java implementations. 
For the first time when testing the logic we can forget about the second function and put all the logic in the first one. It will speed-up development, because we won’t have to scan logs from both functions and we won’t have to wait for async execution completion. We just need to make sure we take a small git repo for the analysis, so that it can complete in the max. timeout of the first function, which is 30 sec.

Then we will move the implementation to the second function (called “job”) and in the first one (called “webhook”) we will only invoke the second one asynchronously using AWS JDK.

Here is the logic of webhook processing:

For security reasons, check if all required request headers are present:

Map<String, Object> headers = (Map<String, Object>) input.get("headers");
String sig = (String) headers.get("X-Hub-Signature");
String githubEvent = (String) headers.get("X-GitHub-Event");
String id = (String) headers.get("X-GitHub-Delivery");
validate(notEmpty(sig), "No X-Hub-Signature found on request");
validate(notEmpty(githubEvent), "No X-Github-Event found on request");
validate(notEmpty(id), "No X-Github-Delivery found on request");

Then take the signature and compare with value calculated by us:

String calculatedSig = "sha1=" + calculateRFC2104HMAC(body, WEB_HOOK_TOKEN);
validate(sig.equals(calculatedSig), "X-Hub-Signature incorrect. Github webhook webHookToken doesn't match");

If it does not match we will throw exception.

The algorithm to calculate signature is simple and we’re using only standard java classes:

import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
...
private static final String HMAC_SHA1_ALGORITHM = "HmacSHA1";
private static final char[] hexCode = "0123456789abcdef".toCharArray();
public static String calculateRFC2104HMAC(String data, String key)
throws NoSuchAlgorithmException, InvalidKeyException {
SecretKeySpec signingKey = new SecretKeySpec(key.getBytes(), HMAC_SHA1_ALGORITHM);
Mac mac = Mac.getInstance(HMAC_SHA1_ALGORITHM);
mac.init(signingKey);
return printHexBinary(mac.doFinal(data.getBytes()));
}
private static String printHexBinary(byte[] data) {
StringBuilder r = new StringBuilder(data.length * 2);
for (byte b : data) {
r.append(hexCode[(b >> 4) & 0xF]);
r.append(hexCode[(b & 0xF)]);
}
return r.toString();
}

Next, we can extract the body from the webhook request.

String body = (String) input.get("body");
PushEvent pushEvent = OBJECT_MAPPER.readValue(body, PushEvent.class);

For this purpose I am using Jackson library to deserialize JSON string into my Java bean `PushEvent`.

We need only the following data from request:

  • ref — a branch name, e.g.:
"refs/heads/test-1"
  • repository.name — a repository name
  • repository.url — a repository url, e.g.:
"https://github.com/john/reponame"
  • repository.owner.name — a github user name
  • compare — a diff url, e.g.:
"https://github.com/john/reponame/compare/bd31aff3fd8c^...53b54347ec75"

First, we need to clone repository to Lambda temp folder. For this we need ref and repository.name.

Next, we search git history from the very beginning to find a list of top 10 most frequently modified files with a count of their modifications.

Then we take a list of files edited between commits in compare url, so we have to extract the hashes from it. We will get values like: bd31aff3fd8c^ and 53b54347ec75, which we can use to make a diff.

If any of these files belong to the most frequently edited ones we will create a comment using Git API in each of pull requests for analysed branch.

The thing is that webhook does not contain information about all pull requests for the branch so we need to call another Git API to get this info.

Finally, for each pull request we will add a comment.

The curl commands for mentioned Git APIs would look like these:

  • to get pull requests
curl -X GET -H 'Authorization: token {personalApiToken}' \
'https://api.github.com/repos/{githubUser}/{githubRepo}/pulls?state=open&head={head}'
  • to add review comment
curl - X POST - H 'Authorization: token {personalApiToken}' \
https://api.github.com/repos/{githubUser}/{githubRepo}/pulls/{pullRequestNumber}/reviews \ -d '{"event" :"COMMENT", "body" : "{reviewContent}"}'

I am using a lightweight REST client (Unirest) to make these calls.

This example has thought me that JGit API is really powerful and there is no need to have git executable installed to use it. We can do anything, or even more of what we would do using git commands. 
It would be e.g. interesting to compare the changed fragments of files, analyse them for some coding rules violations and add specific comments next to each line similar way like when people do code reviews.

There are already such services in the market. Some of them are also utilising artificial intelligence to do some predictions on project code evolution.

Knowing how to play with git yourself opens new possibilities. You never know when you come up with an idea for an analysis that no one has done yet.