As a large scale tech company with so many android developers, we use Jenkins CI in Tokopedia for faster and smoother processes to create an android apk and aab build for integration testing before merging each feature branch to the main one. All android developers are happy with the systems. Then, everything changed when the Fire Nation attacked.
Everything seems fine in the beginning, but many problems arise when Jenkins’ usage is very high. One of them is missing Jenkins’ build result due to the limitation of only the last 30 jobs are kept. Yes, we cannot keep all the jobs cause it will take very huge storage space on Jenkins’ server.
The most common incident goes like this. Let’s say Foo is an android developer. Foo runs a Jenkins job to create apk of his feature branch for integration testing. While the Jenkins job is running, he does some other stuff (code reviewing, meeting, etc) cause usually, a job to build full apk takes time. The job eventually completed and the resulting apk is ready to be downloaded. But no one tells him that the job is already completed until it is too late: the job is missing and overridden by other newest 30 jobs. Foo cannot access the resulting apk anymore. He has to run the job all over again and of course, waiting again.
The day also not going well for Bar, the other android developer. He also runs a Jenkins job to create apk, his Jenkins job eventually failed due to some compile error, but he even doesn’t know that his job is failed and cannot see the error message. This error log has already gone along with the corresponding job, overridden by other newest 30 jobs.
Facing this terrible condition, my brain starts to think of what can be changed and what cannot be. I know that removing the restriction of “only last 30 jobs are kept” is not a good idea, cause the machine will blow up due to insufficient storage.
Let’s do some little math here. A single apk size is around 100MB. If developers and test engineers run 500 builds each day, it will take 1 TB space in 20 days if this branch-builder job is the only job in Jenkins. In reality, we have so many other Jenkins jobs: for performance monitoring, Play Store bundling, unit testing, PR checking, etc. This will make the Jenkins server runs out of space in just a few days if there is no number of jobs to keep limitation. We don't want to clean up the storage space manually every few days, right?
Then an idea sparks as lightning inside my mind. Actually, the root cause is this: a developer didn’t know if the build already succeeds or failed until it’s too late if they don’t refresh the Jenkins page on regular basis. So the next thing to do is figuring out how to make developers get notified about the build result.
Knowing that the best and fastest way to get developers notified about something is through slack, more than email and anything else, I planned to create an automated system to notify each developer through slack about the build result on Jenkins.
This cool API from slack really helps a lot, which can be used to send a message automatically to a specific slack channel, mentioning a specific user, and replying to a specific thread. All can be done via a simple Http request.
What we need to use this is just creating a slack bot and simply get that slack bot access token. And voila, we can do all slack activities on behalf of that slack bot.
To create the slack bot and get the token, simply open this site.
After creating the bot, defining its scopes, and installing bot to the workspace, we can get the access token here:
The good thing is Jenkins script can be easily crafted with a regular shell script, and we can trigger a simple Http request on shell script via curl command.
Here is the curl command.
curl --request POST --data 'token=insert_your_token_here&channel=slack_channel_name&text=hello%20world' https://slack.com/api/chat.postMessage
Here is the result
Jenkinsfile script in the builder job needs to be modified to integrate this slack notification system. At the beginning of a builder job, I will mention the corresponding user that runs the build job on a specific android-dev-branch-builder-info slack channel, to make him/her aware that the job already running and in progress.
Fortunately, both Slack and Jenkins users are using the same company email, so these email commonalities can be used to get the slack user id to mention. From the Jenkins script, we can retrieve the job runner’s email address, then use that email to retrieve the slack user id via this slack API.
Here is what the message looks like.
After that job is finished running, the job runner will be notified again whether it failed or succeed by replying to the previous beginning-of-the-job message in a thread. When we sent a message earlier to notify that job is started, that message’s thread id is retrieved from the slack API response. Later, this stored thread id can be used to reply to the message in a thread.
Here is what it looks like.
If it’s a successful one, it will send the download link so the corresponding developer can download it right away before it’s gone and overridden by another newer jobs.
If it’s a failure, it will also notify the error message preview, so the developer doesn’t need to go to Jenkins console output, scroll scroll scroll, and frustrate about finding out what is the error.
Here is what the preview looks like.
This preview message can be retrieved by outputting the stderr of the Gradle build command to a log file via tee shell command, so it can be branched to perform parallel output both to the console and log file.
Whoa, the error log is bloated? Don’t worry, we can condense that huge log file by using the grep shell command to find out “What went wrong”, then using la and lb (means line-above and line-below, respectively) command to retrieve a subset from only n lines above until n lines below that “What went wrong” phrase.
Also, it will be nicer to provide an instant rebuild button, that can be used to trigger a rebuild on Jenkins with exact parameters (branch name, version, etc) without going to the Jenkins page at all. Simply enable the remote trigger build option in Jenkins job configuration, and we’re good to go!
How about the parameters? Well, It can be indirectly stored inside that build failed notification slack message, in a form of a hyperlink inside that instant rebuild button.
After this system is implemented, harmony is coming back to the world. No developers are missing his/her build anymore. Henry Pri once again saves the world (I guess?) Yeah that’s the life of Principal Engineer, to solve a complex problem in the development world by exerting a simplest yet impactful solution possible.