Migrating iOS GitHub Actions to Self-Hosted M1 Mac Runners

Published in

Whatnot Engineering

6 min readAug 23, 2023

Since creating CI/CD automation using GitHub Actions, we’ve been happy with the improvements and simplification it has brought to the delivery of new builds for the Whatnot iOS app. GitHub Actions has become an integral part of our workflow, serving not only for building and delivering new builds but also for tasks such as nightly pull request testing and running UITests. However, while everything seemed to be running smoothly, there was one significant issue that posed a challenge for us — our workflows were notably slow.

In our iOS repository, we utilized the default GitHub macOS-based machines (each equipped with a 3-Core CPU (x86_64) and 14GB of RAM). When executing workflows to build and test our PRs though, the jobs often took longer than 50 minutes to complete. The slow pace made it painful to re-run the job if we needed to fix a typo in a PR, or due to a failure from a flaky unit test.

Improving caching

In our project, we utilize the Swift Package Manager to manage dependencies. Our initial focus for enhancing build performance centered around optimizing caching. To achieve these improvements, we used the hash of the Package.resolved file, with the cache’s content including the SourcePackages directory from SPM. This cache was implemented with the aim of speeding up dependency resolution time, in the hope of delivering time savings.

- name: Restore SPM packages from cache id: spm-cache-restore uses: actions/cache/restore@v3 with: path: DerivedData/SourcePackages key: ${{ runner.os }}-spm-${{ hashFiles(‘Whatnot.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved’) }}

While this solution was functional, regrettably, we did not observe the significant improvements in build times that we were expecting.

An alternative approach

We knew that our project could be built in under 10 minutes, locally on our M1/M2-based laptops. So it was especially frustrating that the default GitHub runners were so slow. We decided to try using self-hosted runners and throw a more powerful machine at the problem.

To put this approach to the test, we configured a Mac mini with an M2 chip that was available at our office. The setup process itself is remarkably straightforward. Beyond installing Xcode and the necessary tools required by the workflows, you’ll need to download the actions-runner code, unzip it, perform the configuration, and initiate the runner.

The results exceeded our wildest expectations, building and testing our application on that Mac took around 12 minutes. At that moment, we were certain this was the path we needed to follow. The remaining question was, where should we host these machines?

Runners hosting

Our iOS team is distributed across various regions worldwide. Initially, one potential solution for hosting our runners involved deploying machines in different office locations. In such a scenario, even if electricity were to be disrupted in one location, runners in other places would continue functioning seamlessly. The likelihood of all offices experiencing simultaneous power outages is minimal. However, this approach posed challenges. With a significant number of team members working remotely, ensuring the constant online availability of all runners became essential. This presented potential operational concerns. Furthermore, if GitHub in the future offers M1/M2 Macs, the ability to easily halt EC2 instances is factored into our decision-making. In light of these considerations, we ultimately opted for Amazon EC2 Macs as our chosen solution.

Thanks to our infrastructure team, the creation of these instances has been made exceptionally straightforward. Currently, we have two instances in place, and scaling up the number if the need arises is easily achievable. The configuration process for these machines closely resembled that of a typical Mac, with just a few additional steps required to address certain issues we encountered which we lay out next.

Challenges

Storage

Despite our configuration assigning 500GB of storage to these machines, not all of this space was accessible to logged-in users. A significant portion was allocated to other partitions that utilized only a small fraction of the available capacity. When we reached the capacity limit and one of our runners went offline, we were compelled to take action. Fortunately, we were able to fix the situation using diskutil. This tool enabled us to resize our partitions:

diskutil list // to find identifier of a disk were macOS was installed diskutil repairdisk /dev/disk1 // it was required to perform before resizing container diskutil resizeContainer disk1s2 0 // 0 for all of available space

After implementing this modification, we encountered no further issues with available space.

DNS

The subsequent challenge arose during the migration of our job responsible for running our nightly UITests. Following the migration, all of our tests began to fail. Upon investigating the issue, it became evident that the application was unable to establish a connection with our servers. An error message reading Could not resolve host pointed to a DNS-related problem. Specifically, the DNS was unable to locate the whatnot.com domain, even though other domains were functioning as expected. To address this issue, we had to configure a custom DNS server. After deliberation, we opted for a public DNS server provided by Cloudflare:

networksetup -setdnsservers Thunderbolt\ Ethernet\ Slot\ 0 1.1.1.1

After implementing this configuration, the application’s connection with our servers was successfully restored.

Terminal

After initiating the runner, it became necessary to find a method to detach from the terminal without halting the runner itself. Additionally, we observed occasional cases where the runner command finished with errors, requiring an automated restart mechanism. Furthermore, situations arose where a runner would remain offline until manually relaunched. To tackle these challenges, we devised a daemon configuration and leveraged launchctl for its management.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN"
 "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
 <key>Label</key>
 <string>com.whatnot.selfHostedRunner</string>
 <key>ServiceDescription</key>
 <string>GHA M1 Self Hosted Runner</string>
 <key>ProgramArguments</key>
 <array> 
 <string>/Users/user/Developer/actions-runner/start.sh</string>
 </array>
 <key>RunAtLoad</key>
 <true/>
 <key>KeepAlive</key>
 <true/>
 <key>StandardOutPath</key>
 <string>/tmp/self-runner.log</string>
 <key>UserName</key>
 <string>user</string>
</dict>
</plist>

Then we just had to call the line below and we could forget about keeping it alive.

launchctl load /Library/LaunchDaemons/com.whatnot.selfHostedRunner.plist

Versioning

Our final challenge arose due to the gradual nature of our transition, with jobs being migrated one by one. This led to a disparity in the Xcode versions between our self-hosted runner and the default one on GitHub Actions. As a result, we encountered compiler errors a few times during TestFlight builds due to incompatible changes between Swift compiler versions. We addressed this issue by migrating all our jobs to self-hosted runners. For any future transitions, we would ensure the installation of the same Xcode version on every machine. The Xcode update could be executed as each job is moved to its new environment.

Did it affect our costs?

Cost savings weren’t the primary driver behind our migration. Our main objective was to enhance our day-to-day jobs, eliminating concerns about whether a PR will be merged before a release cut. We wanted to have fast and reliable CI/CD workflows. In light of these goals, we performed rough calculations to determine if the addition of hosting two M1 Mac instances on the Amazon EC2 platform would result in increased expenses.

We initiated the process by calculating the expenses we had when utilizing Intel-based Macs from GitHub. Upon examining the cost of executing 1 minute of workflow, we determined that it amounted to $0.08 per minute. After checking the number of workflows executed in a month on GitHub, approximately 1,500 executions, and multiplying it by 50 minutes and $0.08, we arrived at a total of $6,000 per month. After the migration, our PR validation job incurred zero billable minutes, and the combined cost of our two AWS M1 Macs was $1000. This transition revealed a cost reduction: we were able to save around $5,000 per month!

Summary

We’re very happy with our current CI/CD setup. Our transition to self-hosted runners resolved all the challenges we had been facing before. Our CI/CD pipeline has now become fast and dependable, significantly improving the workflow for every contributor to our iOS project. While we encountered a few configuration issues when setting up Macs on Amazon EC2, we managed to successfully resolve all of them. Currently, our setup is very stable, and we haven’t experienced new issues for a long time. Though our primary motivation wasn’t financial, we have realized significant monthly savings.

We strongly recommend considering a similar migration for any iOS team that is not using M1/M2 Macs for their CI/CD yet.