Evolving our Android CI to the Cloud (1/3): From Jenkins to GitLab CI

Published in

BestSecret Tech

8 min readJan 26, 2024

As an Android engineer, you’re passionate about crafting innovative features that captivate your users, not tangled in the complexities of infrastructure management. Well, guess what? Your awesome features won’t reach their full potential without the unwavering support of a robust and cutting-edge CI/CD infrastructure powering your development workflow 😉

At BestSecret, we ship a new version of our mobile apps every week, making our pipelines a central piece of our Continuous Delivery strategy. Any glitch has a high impact on our daily work. Investing time and effort in keeping your CI/CD reliable and efficient always pays off because, ultimately, the pipeline is the judge that decides whether a feature can be safely put in the hands of your customers.

The Pipeline Driven Organization - Enabling True Continuous Delivery

Many organizations try to implement continuous integration or continuous delivery, but they get stuck in the process…

www.infoq.com

Join us in this series of three articles where we share our experience in transforming our legacy Android CI into a modern Docker-based CI running in the Cloud 🚀 We’ll delve into the core aspects of a high-performing CI/CD and explore the hands-on details of dockerizing Android tasks.

Overview of our Android Jenkins CI

The previous CI/CD used by our Android teams consisted of two on-premise bare-metal servers running Jenkins in a Controller/Agent setup. The pipelines include different steps ranging from building the artifacts to publishing the app on the different distribution channels.

The Controller simply orchestrates the agent and supports the main console. The Agent was a 40-core Intel Xeon with 96GB RAM running Ubuntu and did most of the heavy lifting. Among the tasks it took care of were:

Build processes
Run tests
Hosting the emulators for the instrumented tests
Pushing the artifacts to the distribution channels

There were also other elements playing a role in our CI/CD infrastructure, for example, the Sonarqube server, Firebase App Distribution, etc. The overall system looked like this:

The tasks that require a device (e.g. instrumented tests) could spin up and shut down stateless Android emulators installed on the Agent as per demand.

Different Types of Pipelines

The Android teams follow the GitFlow approach with pipelines covering every step of the process. Even though our system is a bit more intricate, the diagram below breaks down the most important pipelines, outlining their steps and when they kick into action:

Relevant Types of Pipelines in our Workflow

As you can see, the most comprehensive is the Feature pipeline triggered when the team wants to merge a feature into the Main branch.

How long did a Jenkins pipeline take to complete?

With the setup described above, this was the average duration of the most relevant pipelines:

Performance of Jenkins Pipelines Measured in Minutes

In a nutshell, developers must wait around 90 minutes to get a green pipeline that allows them to merge their features into the main branch. The step of the pipeline that took most of the time was the UI tests with 45mins approx. Also, it’s important to stress that all pipelines were executed entirely by the Agent, and the CI was configured to allow a Subtask and a Feature pipeline to run concurrently.

Why Was The Jenkins CI Problematic?

Let’s describe the most relevant problems associated with the Jenkins setup as we had it.

Fragility: disruption in team’s work if hardware breaks

The whole CI relied on the bare-metal server Agent to keep working, so this essentially turned it into our Single-Point-Of-Failure. Should Agent hardware fail, the working flow of the Android teams is completely disrupted. To make things even worse, restoring the Agent configuration wasn’t a trivial task because the whole toolchain needed for building and testing the Android app was installed locally in the host OS.

It didn’t scale up easily

The only way our Jenkins setup could scale up is by adding more on-premise machines and registering them as workers. However, this requires installing the whole software toolchain. While doable, this can easily backfire because the moment we need to update any piece of the toolchain, we need to configure it on every single server. So the more hardware we throw at the environment, the more expensive it becomes to maintain.

Required complex configuration

Running instrumented tests involved the usage of emulators. Since multiple concurrent jobs used them, we needed to make sure that no port collision took place. Otherwise, the emulator can’t launch and the testing job fails. To sort this out, we assign different port ranges depending on the type of pipeline. This logic added complexity to the Jenkinsfile’s.

Lack of Isolation

Sometimes emulators didn’t shut down properly leaving their ports blocked and making subsequent pipelines fail. This violates the principle of isolation which every pipeline should have, that is, the pipeline results shouldn’t be affected by previous executions or other pipelines. Having all pipelines sharing the same machine/OS increased the likelihood of impacting one another negatively and eventually ending up with a failure.

Scarcity

The duration of the pipelines was way higher than the teams liked it to be. Even worse, since the system didn’t scale well, the restriction in concurrent pipelines made, for example, a dev who wanted to run a Feature pipeline had to wait for the current one to finish. As a result, developers often faced delays and bottlenecks, disrupting their workflow and slowing down the progress of the project.

For all these reasons, we sat down and decided that it was time to come up with something better 🤔

Designing the New System

Before rolling up our sleeves and diving headfirst into the task, we had to figure out the answers to these questions:

How do we want to run our Android builds?
Can we reduce the impact of the flaky tests?
How would we scale up in case the teams grow?
Would it be possible to improve the performance of the pipeline?

The brainstorming led us to the following list of characteristics we wanted our new CI/CD to have:

Robustness: removing dependency upon on-premise hardware

The first thing we want our system to be is robust. As a high-performance team that ships new versions every week, the last thing we want is to be impacted by any on-premise hardware meltdowns. We aimed to run our pipeline primarily on the cloud so we don’t have any Single-Point-Of-Failure. This doesn’t mean we can’t run jobs on any piece of on-premise hardware, what it means is that the pipelines should be able to run anywhere without being tied to a specific hardware with a particular toolchain installed. If a node fails, it should be replaceable quickly and easily.

Better performance through parallelization

The Android teams were not happy with Jenkins CI/CD's performance, so we aimed to improve it in two ways:

Increasing the concurrency of the jobs to shorten the overall duration. When there are free execution agents (runners), we want to parallelize as many pipeline steps as possible.
Removing the constraint about the number of pipelines running. A pipeline shouldn’t be waiting for the previous one to finish. Instead, it should begin executing as soon as there are free runners available.

Isolated environments per job

One of the biggest sources of confusion was when a pipeline created side effects on another. This is a common issue in the realm of Unit Testing and usually comes about when there are resources shared by multiple suites. In the case of a pipeline, we wanted to isolate unrelated jobs from each other so they wouldn’t produce any undesired side effects.

Repeatable jobs

When you discover that an external condition breaks the entire pipeline, the last thing you want to do is run the whole pipeline over. The new CI/CD must allow retrying jobs independently for those cases where external factors (e.g. connectivity glitches) make them fail.

The Android GitLab CI

Taking all of these into account, we went for the next configuration:

These are the main aspects of the new CI:

GitLab CI platform. Having all our repos hosted in GitLab made GitLab CI our obvious choice for replacing Jenkins.
Jobs run into Docker containers (build, tests, coverage, vulnerability scan…) which provide the desired level of isolation. This requires Docker images capable of running Gradle tasks and emulators.
An elastic pool of generic runners. By generic we mean plain Docker executors that are cheaply replaced in case of failures. The Runners are just regular Linux (virtual) machines.
Leverage existing on-premise hardware and turn them into GitLab runners too. The new CI uses Cloud and On-Premise machines indistinctly so the Agent and the Controller previously used by Jenkins now become GitLab runners.

Conclusion

Relying on an unstable legacy CI system can eventually bring your development process to a halt. This will significantly impact the quality and frequency of your software delivery. Whether you’re aware of it or not, the outdated infrastructure is hindering your team’s productivity and innovation. The modernization of such a critical piece of your workflow might seem daunting, but the investment always pays off.

In this first part of the series, we’ve described how the legacy Jenkins CI for Android worked at BestSecret and the problems that made us switch to a more scalable and maintainable solution. We presented a general view of the new CI based on Docker images and GitLab CI. 🐳

Do not miss the next part where we’ll get into the technical details of the migration. In particular, how to dockerize the Android tasks so you can run your CI/CD jobs virtually anywhere.

Evolving our Android CI to the Cloud (2/3): Dockerizing the Tasks

A hands-on approach to dockerize your Android tasks

medium.com