CI/CD at SwiftKey (Part 4)

Tehmur Khan
Microsoft Mobile Engineering
5 min readJun 7, 2021

In our final part of this series of articles (Part 1, Part 2, Part 3) we will cover some of the issues we faced post migration, how we went about improving our infrastructure to address stability and scalability, and share some of our learnings and plans for the future.

Infrastructure improvements

A few months after migrating our team’s CI/CD onto Azure DevOps (ADO) we faced some recurring issues. They mostly centred around the stability and scalability of both agents and emulators.

Agent stability

Our initial agents were stateful which meant we would create the agent once and then reuse the agents across multiple runs and different pipelines. This allowed us to benefit from incremental code checkout times but there were downsides. One major issue we faced was that the agent’s state would build up causing out of storage build failures. While we made attempts to clean up after each job, there was always some state left behind due to the impracticality of removing all new changes (e.g. reverting package updates). After such build failures we would need to manually redeploy the agents. Another issue caused by sharing state across runs was there would be build failures caused by unexpected side effects when one job updates a package, but a future run required an older package.

To resolve these issues, we moved over to using one-shot self-hosted agents. These agents are configured to gracefully terminate after the completion of a single ADO job. This effectively ensures that the workspace remains clean between builds. Something to note is Microsoft-hosted agents automatically achieve this but depending on your needs, managing your own agents could offer more flexibility as discussed earlier.

With our new one-shot agent setup when the agent completes its job, it sends a signal to the container killing it. Nomad (a workload orchestrator to deploy and manage containers) then recreates it immediately due to the count constraints defined in the agent config, adding a fresh agent into the agent pool.

Due to the ephemeral nature of these agents, a full repository checkout is performed every time a job runs on these agents. This is a trade-off we felt was worth having to achieve better stability.

How one-shot agents are assigned to jobs. 1. ADO submits job to a free agent. 2. A free agent in the agent pool runs the job. 3. Agent terminates after finishing the job execution. 4. Nomad deploys a new agent to the agent pool.
How one-shot agents are assigned to jobs

Agent scalability

Previously, we had all Android emulator setup for each API level installed on the agents. Whilst this allowed us to avoid downloading dependencies at runtime, saving job execution time, it resulted in longer agent deployment time.

It was also not scalable as having agents with high storage space requirements meant we would have fewer agents as we needed to stay within a cost constraint. Thus, a trade-off had to be made. We decided to slim down our agent size by taking out the emulator setup and doing this at runtime. This worked well for us as it was only our Android test run for all API levels that needed multiple emulator configurations. This ran nightly so its total duration was not as important compared with other CI jobs triggered during office hours e.g. test runs on PRs.

Emulator stability

In certain circumstances we observed that builds were not reproducible as one build may pass but subsequent runs fail, even when no code changes had been made. After digging into this we found this was an issue with updates occurring to the emulator package when redeploying agents. As new emulator versions are released, certain parameters may be deprecated or removed, resulting in the same command behaving differently across versions. In our case the emulator failed to start up.

The update in package version was because we used the following command:

// Installs the latest release of the emulator tool​run echo y | sdkmanager emulator

This would install the latest version of the emulator package each time we build our agent image. Our solution was to manually setup these emulator packages to pin to specific versions using the direct url.

// Pin a specific version of the emulator tool​RUN wget "https://dl.google.com/android/repository/emulator-linux-${EMULATOR_VERSION}.zip" -O emulator.zip && unzip emulator.zip -d ${ANDROID_HOME} && rm emulator.zip​

Emulator scalability

When we moved to creating Android Virtual Devices (AVDs) at runtime (as opposed to bundling them in the agent image as mentioned before) we faced issues where the VM hosting the agents would crash, bringing down all agents hosted on it. This was highly problematic as it caused downtime until those agents were brought back up by our Cloud Infrastructure (CLI) team.

We needed to support creating new emulators due to our nightly tests that required emulators for multiple API versions. An interesting thing we noticed was that Microsoft-hosted agents did not have this issue. We couldn’t get to the crux of the crash we faced as we were not provided with logs when this occurred. The solution we derived was a pseudo-caching mechanism where we built each API level emulator on Microsoft-hosted agents, packaged the system images and AVD files, and then published them to Azure Artifacts Feeds. We were then able to download these images on demand and use them anywhere we needed an emulator. This was a more complete solution which would also provide other benefits, such as using known emulator versions.

Learnings and next steps

To summarise, there were a few learnings we wanted to share from our experience.

Emulator problems are a perennial issue, and we still face adb connection failures and timeouts.

We also learnt that we need to phase our CI/CD migrations and updates carefully as it can have a wider impact if there are issues. You should take your time when rolling out new CI changes, by testing it over time in a development environment or with a small use case first before proceeding. If the issues occur in the production CI, then all developers could be affected.

CI infrastructure investment should be a continuous effort. Even if things are working smoothly there will always be ways to improve and optimise. Investing more time in this space can result in big gains for the whole team.

Within the SwiftKey Android team, some areas we would like to invest in include:

  • Sharding our tests to allow Android tests to run in parallel on multiple emulators.
  • Automating more pre-release checks. Some parts of our release process still rely on manual checks, such as verifying various stability data and telemetry metrics. We are in the process of automating these checks via back-end queries integrated into our release pipeline, to ensure we have the core functionality working as expected before rolling out the release further.
  • Improving testing between development and production CI deployment. This will give us an opportunity to test changes in development before deploying them to production, avoiding the need to roll back if there are any issues.

--

--