How we manage 200 Mac Minis for iOS CI at Agoda

Published in

Agoda Engineering & Design

14 min readJul 22, 2022

At Agoda, we run over 23,000 automated tests at an average of 30 minutes per build chain to maintain the quality of our iOS application and provide the best experience for our users. On our iOS pipeline alone, we run around 800 builds daily, which requires more than 120 hours in CI time.

As with any other company, we were faced with the option of managing our own Mac infrastructure or paying someone else to do it. Making hundreds of builds as quickly as possible also brings new problems at the scale we work. So we choose to manage our hardware.

This article will show how we manage around 200 Mac machines in our server rooms. It will also cover the general overview of our hardware inventory and how we maintain it and briefly look into the software we use to utilize it.

How it all started and the problem we solved

We started with a single Mac machine to carry out routine iOS build and release processes. The model was an Apple Mac Mini 2010 that still had an optical drive and a bottom that was simple to remove! It made it simple to repair any damaged components or easily upgrade RAM.

*Mac Mini Mid-2010 (on the MacBook Pro 2022)*

Later, in 2016, when we began switching from manual to automated testing, the farm (as we refer to our collection of Mac Mini infrastructure) began to grow. Additionally, the 2013 “Cylindric” Mac Pro completed the first iteration of our decent-sized infrastructure set.

*Who needs KVM when you can manually plug in the keyboard and screen to every machine?*

Unfortunately, our automation test quickly outgrew our farm’s resources. As a result, performing our entire regression of all available tests would now take up to a week of effort. We had to make a choice. Naturally, the options are to continue growing and developing better software for utilizing it or to move to a cloud provider but back then, there weren’t many cloud service companies that offered macOS, so the choice was much easier.

Even those available were rather pricey and far from our main development center in Bangkok.

*Total number of automation tests by the year-end*

Review of the hardware in our device farm

Most of our hardware comprises Mac Mini 2018s with 64GB of RAM and 256GB SSD. However, over half of our inventory is still Mac Mini 2014 with 16GB RAM and 256GB HDD, which will be the first to go when we make hardware upgrades. We have also added a few Apple Silicon machines for testing, but they are in the minority.

Xcode and its SDKs are known to be quite large, and you might run out of space if you wish to maintain different versions of them. We learned that investing in a disk size of at least 512GB is better.

For the physical infrastructure maintenance part, we generally use quite popular solutions. We use Sonnet RackMac mini for our rackmounts, which can accommodate two Mac Minis in a single rack unit.

Besides Mac Mini rackmounts, each of our racks has:

UPS (uninterruptible power supply)
A KVM switch (a device that connects keyboard, screen, and mouse to multiple computers)
USB power hubs (required by KVM adapters)
network switches

Each of our racks holds up to 40 Mac Minis for easy maintenance. Naturally, each rack can accommodate more than 40 machines, but we picked 40 since our KVM switches only have 40 ports, and we utilize the ATEN KVM over IP solution. The UPS prevents our hardware from shutting down abruptly during a power disruption.

*One of our six racks that house 40 2018 Mac Minis, back and front*

Much care has been put into cable management. In our case, each machine has 5 to 6 cables attached to it, which include power, Ethernet, an external SSD, a KVM HDMI, a KVM USB cable, and a rackmount USB cable (for front USB ports).

As for KVM, we found it very helpful to log in to our computers and see what’s on the screen of our machines. Sometimes, they can become unresponsive over the network, so we can check on them without going into a server room.

*One of the KVM switches that we can access even during work from home times (yes, most of these are not running macOS)*

KVM adapter itself has an HDMI and 2 USBs. One goes to the machine for I/O devices and the other into the USB power HUB. USB power HUB is required to send signals to the machine before it starts sending power to the built-in USB ports. It is essential if you need to boot Mac Mini into recovery mode over the KVM.

*ATEN KVM adapter for our precious Mac machines*

The importance of server room cooling cannot be overstated. An industrial-grade air conditioner is essential to keep the room cool. The minimum cooling temperature for the server room is typically between 20–22 celsius. However, there were a few additional issues that needed to be addressed. The rack mounts we used were designed to allow airflow through them. That wasn’t always the case, of course, since our air conditioners were often installed in the ceiling.

*Airflow design of the rack mounts we use*

The fact that we packed our shelves tightly to reduce the space most likely didn’t help much either. Altogether this would lead many Mac Minis to overheat during our peak times and throttle the CPU to cope with it, significantly reducing its performance. And with how cooling on these machines works, the CPU keeps working in the reduced performance until some time has passed after thoroughly cooling down or if you restart the device manually. And, of course, neither was a good option for us.

To increase the airflow in the space, we installed industrial-grade fans that would alternate on and off at the required hours. In doing so, we considerably decreased the machines’ overheating. Currently, it is rare for our Mac Minis to enter throttle mode.

*Industrial Fan to maintain airflow in the server room*

To ensure that our Mac machines’ temperature is always in suitable condition, we set up monitoring dashboards and threshold alerts in Grafana. Each Mac Mini sends its current CPU temperature value to the Prometheus database every minute using the Telegraf metrics collector.

*Example of the dashboard to monitor CPU temperature (green lines on the heatmap are barely heating M1s)*

Test utilization: Unit vs. UI tests

Before we examine how we utilize the 200 Mac computers, let’s briefly discuss testing. As previously mentioned, we shifted from manual to automated testing in 2016. Since then, our number of tests has increased exponentially, culminating in 2020 at slightly over twenty thousand.

Most of these numbers come from unit tests, which can be many different types, like general unit, snapshot, and integration tests. This post will not go into depth on the iOS testing approach. Nonetheless, it is vital to note that unit tests should precede UI testing in most iOS projects because they are efficient, reliable, and do not depend excessively on the simulator’s GUI. It’s also not difficult to scale besides adding several build machines. And if they get too large, there is always the simple option of splitting the test targets to enable parallel execution.

*iOS tests type distribution at our project*

iOS UI tests are an effective tool but generally come with more overhead. More resources are needed to run them because of the simulator’s UI, and scaling is more difficult. Writing and maintaining them is considerably more complex, and they also break more often with each new release of Xcode.

For many years our test base kept growing, and we still have a lot of relatively old tests that aren’t up to today’s standards, with issues like shared resources or poor success rates due to external dependencies. We also had to introduce retries to them because some of our UI tests are pure E2E with multiple external dependencies.

With all the mentioned problems, it was hard for us to use parallel simulator runs introduced with Xcode 10. We tried our version of that for years using tricks with Xcode build. Still, after running a single simulator experiment and noticing significant improvements in UI test stability, we switched to a single simulator per macOS instance.

Hardware utilization

Now that we have a general idea of how our tests work, it’s time to look at how the hardware fits in. We categorize our 200-machine inventory into two categories: bare metals and virtualization. Let’s start with bare metals.

Bare Metals

The bare metals category is small and consists of just pure macOS machines. The total number of 35 devices accounts for around 15% of our inventory, with the majority being 2018 Intel Mac Minis. With the recent project migration to natively support Apple Silicon finally completed, we also started evaluating a few M1 machines, such as the 2020 M1 Mac Mini and 2022 M1 Ultra Mac Studio.

*One of our newest toys we are evaluating doesn’t have a rackmount yet.*

They act as a build agent in our CI system. The bare metals are doing the regular jobs of building and testing, where everything happens on one machine, or rather the build agent that the CI system selects. We utilize them for everything that requires macOS but does not require iOS Simulator scaling such as build validations, beta publishing, and unit test.

We primarily utilized TeamCity build configurations to manage these machines. Our organization recently started migrating to Gitlab as the source code management tool, including its CI/CD. However, Gitlab does not support many of the TeamCity infrastructure management features. Running all the automation bash scripts over SSH wasn’t scalable.

As a result, we started looking at various methods to manage these machines. Then, our company started using JAMF to manage our developers’ machines. It’s a tool that utilizes the official Apple Business Manager system. One of the best things about it is that it enables you to configure any device out of the box using the Device Enrollment Program. After that, you can manage it directly.

Just by connecting it to the LAN cable, it will provision and install anything you want. Additionally, scaling up automation and management is simple, thanks to official Apple support. But what about other machines?

Why we do virtualization

As mentioned before, UI tests are typically difficult to scale, particularly in our case when we must also account for their unreliability. Of course, you can always use a scaling approach similar to Unit tests. Split your UI test target into many small ones, utilizing many different build agents. That approach has its ups and downs.

However, in 2018, we also wanted a similar way of running UI tests for Android and iOS that is easy to scale on the infrastructure level. So we went with the Device Farm approach. The general idea of device farms is to have some device set somewhere and then connect it to execute your UI tests, whether a simulator/emulator or an actual device.

*General flow of utilizing device farm from CI*

Android makes it relatively easy to deploy and use its emulators. You can launch them with the QEMU virtualizer on any Linux machine that supports the Kernel-based Virtual Machine module. However, it is not the case on iOS, as we do not have emulators per se, and to use iOS Simulator from Xcode, you need a full-blown macOS installation.

Our iOS Device Farm part of the infrastructure uses the same QEMU virtualizer, powered by Kernel-based Virtual Machine virtualization technology. But instead of the iOS, we have to virtualize macOS. It’s essentially a set of machines we provision into a Kubernetes cluster and deploy macOS snapshot images with required Xcode and Simulator versions per our needs.

*Software stack that we deploy to our Kubernetes cluster to get iOS simulators for UI tests*

We utilize both 2014 and 2018 Mac Minis in this part of the infrastructure, which takes 85% of our inventory. The only important part here is to have Apple hardware. The primary resource for good iOS simulator performance on the emulated macOS is RAM which requires quite a lot of it. We also spec out our 2018 Mac Minis with 64 Gb of RAM.

With this, we could achieve a similar approach to both Android and iOS UI testing farms at Agoda, including deployment, monitoring, and utilization. Still, the hardware and problems around them are entirely different.

Drawbacks of in-house device farm

The cost of maintenance

There is a general maintenance cost for this in-house solution. You will need developers to support it or even build teams around it, which is the biggest deal-breaker for many companies and is a reasonable concern.

In our case, It took a dedicated team of three people around a year to build the initial solution for both mobile platforms. Since then, we have restructured the team into Engineering Tooling. The team made improvements, optimizations, and new features to the mobile solution along the way, and its overall scope now encompasses other platforms.

Concerns around virtualization

Performance is the usual concern around virtualization. It’s often mentioned how virtualization using Kernel-based Virtual Machine has bare-metal-level performance. And for the most part, it’s a valid claim until it is not. And the place where it is not is during native iOS builds. Over time we have experimented with virtualizing our whole inventory, including build agents, as we tried to achieve an entirely ephemeral macOS infrastructure.

Even with the most optimizations, we constantly lost at least 40% of the time on each build. And this is a generally known issue in the industry as well, as quite a few companies moved away from virtualization to bare-metals and vastly increased their build performance. It is, of course, less of a problem when we only do so to provide an iOS simulator for UI testing. Those generally do not require high CPU resources, but the problem is something you definitely should keep in mind.

*Generic build time tests of bare metal vs virtualization*

Although we successfully built and continuously supported a Kubernetes cluster with Apple hardware for our needs, virtualizing macOS using Kubernetes to scale for UI tests is probably not something we would recommend nowadays.

There are many pain points with trying to provision Mac hardware, especially 2018 models, due to the T2 security chip. For example, you need to buy an external SSD for every T2-powered machine to boot Linux on them, which you will need for Kubernetes provisioning.

It also requires much custom knowledge, such as general virtualization technologies and even Linux kernel, which is difficult to find in our field. Not to mention that each new software and hardware update by Apple will always break something around such custom tooling. An example is the latest Apple Silicon machines.

There are much easier solutions available on the market. Like Anka, which can virtualize macOS at scale using CLI on the bare metal machines using the native Hypervisor framework. And if you want to cheap out on software, recently, Apple introduced an easier way to create and manage macOS virtual machines with Swift code on the macOS itself.

It uses the same native Hypervisor framework but with a better abstraction in the face of the Virtualization framework. The only downside is that it requires Apple Silicon. And while it goes against our initial point of having similar tooling for Android and iOS, doing the official way is always the best with Apple.

As for us; for the longest time, we used an open-source minimal Kubernetes solution. But recently, we switched to a custom in-house Kubernetes provisioning solution with the fully automated provisioning of both 2014 and 2018 Mac Minis.

The actual software around utilizing it will be discussed in a different article. But we would like to briefly examine how it works here to understand device farm utilization better.

Let’s run a Marathon.

Device Farms are usually an excellent solution for scaling UI tests, especially when paired with a good test runner. In our case, that test runner is an open-source tool called Marathon. It started its development initially at Agoda in 2018 when we tried to solve the same problem mentioned before — similar tooling for Android and iOS. Later, this tool moved outside the company and is now one of the most popular open-source test runners for mobile.

With a similar configuration for both mobile platforms, it works around flakiness (by setting up retries). It also does a fantastic job of parallelizing a considerable amount of devices. And most importantly, it does it for iOS as well. For example, today, we have around one thousand UI tests, and if we try running them on a single simulator, it will take more than 24 hours to complete. But by splitting 70 iOS Simulators between different UI test build configurations in one regression chain, they only take less than 25 minutes to complete.

*Marathon test report of one of our build configurations running 80 tests in less than 10 minutes*

You only need to provide Marathon with a few things. Firstly you need an app binary. We offer a derived data folder created by the build-for-testing mode of xcodebuild. Marathon will parse the list of tests in it and automatically execute all of them unless filters are specified. Second, you need a list of IP addresses of macOS machines, for which we provide our macOS emulators from the Kubernetes cluster.

Lastly, it requires either an SSH key or a password to connect. Using these, Marathon will distribute app binary to all those machines and utilize their iOS simulators. And with it, it will cleverly parallelize all the tests to ensure the most optimized run in terms of execution time. As was shown on the screenshot above, Marathon executed lengthy tests first. But all of this is pretty much configurable.

Conclusion

And this concludes an overview of how we manage our iOS CI hardware at Agoda. The total actual number of machines is currently 228, which powers up more than 800 builds daily with more than 23000 tests per build chain. Although with the plans of adding new Apple Silicon machines to it, we will most likely decommission less powerful 2014 models.

As mentioned before, we recently completed the migration of our iOS project, which has more than 300 modules and 2 million lines of code, to support a full native development on Apple Silicon without any extra compatibility layer. It finally enabled us to start evaluating replacing Bare Metal CI with it. We will talk about the challenges we faced during such migration in another article.