Evolving our Android CI to the Cloud (3/3): Optimizing the Performance

Victor Caveda
BestSecret Tech
Published in
9 min readFeb 9, 2024
Generated with DALL-E

We’ve reached the final part of our journey where we’ve explained how we modernized our Android CI. In the first part, we presented the original CI we were running and the problems that drove us to change it.

In the second part, we dived deep into the technical details of how to create a Docker image 🐳 for building and running your Android CI tasks anywhere.

As we bring this series to a close, let’s delve into a critical aspect that resonates with every developer: optimization. No one is happy waiting a long time for their pipeline to finish. So we’re going to share some strategies we adopted to make our Android GitLab CI faster. We’ll explain how we adjusted settings such as the Gitlab Cache and simplified steps to cut down on waiting times. ✂️

But that’s not all! As we bid farewell to this series, we want to leave you with more than just technical insights. We’ll take a moment to distill the most valuable lessons we’ve learned throughout this journey so you avoid the same mistakes we made 😉. Consider it a roadmap of pitfalls to avoid and shortcuts to success. 🚀

Improving the Pipeline Performance

Let’s go over some effective techniques to reduce your pipeline duration.

Parallelize All You Can

One of the obvious strategies you can introduce in a pipeline to make it faster is running multiple tasks in parallel. In so doing, the overall duration will be determined by the longest task. Our Android app follows the CLEAN Architecture paradigm, consisting of multiple modules with clearly defined scopes.

Check out this great article by my colleague Jose Angel Zamora describing the architecture of BestSecret’s apps 👇

This is the perfect soil for parallelization because you can easily break the work down into different smaller tasks. This allows GitLab to distribute the smaller tasks among the runners of our pool allowing us to use the full capacity of our infrastructure.

Below, you can see one of our pipelines where all tests are running concurrently. Bear in mind that some steps might need to run sequentially, such as the Sonarqube analysis which computes the coverage produced by the Unit Tests.

Test Only What Changes

By only targeting code changes in each update, we keep our test jobs efficient and relevant. Our secret weapon? A multi-module app architecture and the power of Git. They work together seamlessly to pinpoint exactly what needs testing for each merge request

Don’t know how to do it? No worries. This can be easily done with a simple Shell script 😎. Have a look at this snippet:

Script to Detect Modules Changed (Gist)

With the name of the modules, you can easily tell Gradle which test tasks need to be executed in the triggered pipeline.

Tag the Jobs According to the Needs

Finally, if your CI/CD infrastructure comprises different types of runners with different computing capabilities, you’ll want to use the right runner for the right job. As we previously saw, executing instrumented tests entails spinning up an emulator which, besides requiring Nested Virtualization. This is very expensive in terms of CPU/Memory.

Make sure you use the tagging mechanism GitLab offers to run each job in the runner that can best handle it. Using Tags is also a way to parallelize your jobs and allows you to scale your infrastructure horizontally by not relying always on the same type of runners.

We use a combination of Azure VMs and barebone metal runners for jobs involving emulators because they support Nested Virtualization. For other tasks such as SonarQube, vulnerability scans, unit tests, etc., we rely on regular GitLab Cloud runners.

Caching Strategy

In essence, caching is just another optimization strategy, but it has so many levers to pull that it is a topic worth dedicated attention.

GitLab comes with a powerful mechanism of caching which can significantly reduce the time of your gradle tasks. The GitLab Caching documentation is very comprehensive and is truffled with great advice we tried to follow.

This is how we leverage the caching to speed up the pipelines.

Gradle Build Cache is your friend

To use the Gradle Build Cache, launch your build/test with the parameters build-cache and gradle-user-home so a cache is built in a folder under your control. Then use this folder to feed the GitLab CI caching mechanism so you don’t have to start a Gradle build from scratch on every pipeline run.

Enable the GitLab Distributed Cache

In a nutshell, you first need to have a place to store the cache (e.g. S3 bucket, Azure Blob) and then you configure it in your dedicated runners via .toml. You can find an example in the documentation.

Use multiple Caches

The GitLab cache mechanism is based on keys so you can have multiple caches with different policies (pull/push). In our case, we have two levels of caching: Project and Branch/Tag.

Pipelines Using Two Levels of Caching

We use a Project-level cache with a unique key for the static dependencies/libraries of the project shared among all pipelines. The main reason is that they don’t frequently change. In so doing, all branches can benefit from using this global cache in their jobs.

The Branch-level cache uses a key with some identifier associated with the branch. Therefore we can narrow down its scope to the jobs running on that branch. A usual GitLab variable used for this sort of usage is CI_COMMIT_REF_SLUG which uniquely identifies the branch. In this type of cache, we store the Gradle Build Cache we mentioned in the previous point. Hence, the first execution of the pipeline for the branch will run a clean build that will feed the Gradle Cache for the subsequent pipelines. Something similar to your local environment where building a fresh cloned project takes longer than the subsequent builds.

This is how you can configure the two levels for a job in your yaml file:

Job Configured with Two Different Levels of Caching

Adjust the Pull/Push Policy Carefully

The Cache Policy allows you to decide whether a job will only read the cache or also update it. Whereas you’ll want your jobs to benefit from the cache, you probably don’t want them all to update the cache after they finish. The reason is that in big projects such as ours, updating the cache might take a long time 🐌. This will increase your overall pipeline time by some precious minutes. Choose smartly which jobs will be updating the cache and switch to a Pull policy for the rest.

Metrics after migration

So, let's cut to the chase and look at the numbers. Although the new CI brought about many benefits: stability, resilience, isolation, etc, the most interesting metric is how much the pipeline time decreased after the migration. The following chart has been built using a sample of pipelines run over a 15-day period. It shows the average duration as well as the standard deviation.

Metrics of Jenkins vs Gitlab Pipelines

As you can see in the following chart, the average time has been reduced by ~13% for the Subtask pipelines and by ~27% for the Feature ones. The standard deviation for the Feature pipeline has decreased which means that the duration is closer to the mean time.

Another interesting effect is that the standard deviation for Subtask’s time increased a little after migrating. Although the sample taken might explain this, it’s also because we’ve adopted a new method of running tests exclusively for the modules with modifications. Different subtasks introduce changes in different modules which necessarily result in more variability in the duration of this type of pipeline.

Five Key Lessons We Learned

Finally, let’s swiftly reflect on the journey and share some takeaways we learned, sometimes the hard way.

Generated with DALL-E

1. Aim no Big Bangs

Migrating a CI/CD while the teams keep working on the app means that both systems, Jenkins and GitLab CI, have to live together for some time, and that’s ok. Do not aim for Big Bangs, instead, just begin with the simplest possible pipeline (maybe a single build plus unit tests) and let the teams start using it. You’ll receive invaluable feedback and test your assumptions against the real world. At the same time, developers will progressively get used to the new system and learn it bit by bit. As David Farley writes in the last chapter of his great book Modern Software Engineering:

Engineering is about making rationally informed decisions, often with incomplete information, and then seeing how our ideas play out in reality based on the feedback that we gather from real-world experience.

2. Do not Underestimate the Complexity

Migrating a complex CI system is not a low-hanging fruit. It’s quite easy to fall into the Planning Fallacy and underestimate the work required to replicate the existing jobs (build, test, deploy, etc). If the current system has been working for some time, I bet it has a bunch of hacky workarounds and patches you’ll find during the migration. Allocate time for the unknown unknowns and manage the expectations with your stakeholders.

3. Challenge your Assumptions

During the process, we introduced optimizations that didn’t bring any benefit although they looked like great ideas up front. For example, increasing the number of emulators running beyond a certain point doesn’t bring any significant reduction in the integration test time. Depending on the type of test (e.g. e2e, unit, etc), the optimum number of emulators is different for us. Measure things to challenge your assumptions. Without measuring the duration of the jobs, it is impossible to prove your hypothesis right or wrong. I assure you that you’ll be surprised more than once.

4. Your Architecture Matters (a lot)

There are approaches we couldn’t possibly take if our app were a single-module monolith. The architecture of your app has a direct impact on the CI/CD. Thanks to having a decoupled multi-module architecture we can run the tests only of the modules changed and so optimizing the job of our Subtask pipeline.

5. Developers Must Own the Pipeline

At Bestsecret, the app teams are responsible for their CI/CD. While we occasionally seek assistance from our incredible DevOps team, we maintain full ownership of our CI/CD infrastructure. The pipeline serves as a cornerstone of our software development process, dictating the release readiness of features, bug fixes, refactors, and more to our users. There are no shortcuts, developers have to know how it works, how to fix problems, and how to evolve the system to fit their future needs.

Conclusion

We’ve reached the first stop in our journey. The migration of our Android CI to GitLab is now completed but this is only just beginning. There’s plenty of room for improving our CI/CD, particularly in dealing with the occasional test flakiness, further speeding up optimizations, etc.

It’s an ongoing process — a continuous journey that never really concludes. But looking in hindsight, we know we’re in a much better position to keep evolving such an important part of our daily workflow.

If you’ve enjoyed this content so far, be sure to hit that Follow button to stay tuned for more updates on this blog. Drop a comment to share your experiences with your Android CI, or if you have questions, tips, or anything else you’d like to add to the conversation. Your input is invaluable, and we’re excited to keep this dialogue going! 🙌

This journey is the outcome of exceptional teamwork by great engineers. A heartfelt thank you to everyone who contributed, especially to my colleagues Ismael Torres and Jose Angel Zamora.

--

--

Victor Caveda
BestSecret Tech

Principal Engineer @BestSecret | Formerly @PhoenixContactE , @Panda_Security