Moving Our Unity DevOps to the Cloud

Khalid
Mighty Bear Games
Published in
7 min readJan 17, 2023
Photo by An Tran on Unsplash

For most of our game titles, we’ve relied on our on-premise Jenkins duo, a Windows machine and a Mac Mini to serve our DevOps needs. For Disney Melee Mania, we pushed them to the very limit and it was clearly a bottleneck in our development, so in 2022, we began transitioning our DevOps to the cloud. Here are the reasons why we did it, and our current progress so far.

Scalability

We used Jenkins to run unit tests, integration tests, make playtest builds, nightly builds, and release builds. We made it so that any change to our development branch requires a pull request (PR) to be made. Tests need to pass and the latest develop branch needs to be merged before the PR is. As the number of tests grew during times when there were many active PRs, it could take up to 3 hours for Jenkins to complete the tests. Also, whenever a PR was merged, other PRs would then need to merge in the latest changes from develop branch and then be re-tested, causing a cascade of PR test runs. There were instances when we needed to make a build for a playtest but the Jenkins queue was clogged with jobs. We had to manually abort the jobs in order to get the build out faster. It felt like a traffic jam in the expressway and we were the traffic controllers.

Photo by Iwona Castiello d’Antonio on Unsplash

With cloud computing, we can easily provision more resources to handle an increasing workload and have the flexibility to choose what hardware to run it on.

Reliability

We had to actively manage the Jenkins machines. Sometimes the machine would reboot but did not initialise the Jenkins services correctly. Sometimes it ran out of disk space. There was a period when we were in between offices and the machine resided in our CTO’s home, so any power outage or loss of internet connection caused us to lose connection to Jenkins and stall development. If anything was wrong with the machine, only a couple of us were able to figure out what’s wrong and fix the issue.

With the cloud, we don’t have to manage the infrastructure. Since the infrastructure for the GitHub runners is setup through code, if there is an unhealthy instance, we can easily terminate and deploy a new instance.

Ease of Use

We connected remotely to the Jenkins machine through Google Remote Desktop since we didn’t want the machine to be publicly accessible. Only one user could login at a time and as you can imagine, it was awkward to coordinate logins. We had to organise ourselves and wait turns just to check on a status of a job, and that wasn’t the best for productivity.

One of the things that we wanted to achieve as an engineering team was for non-engineers to be able to make builds so that they can test out whatever feature they wanted without engineering assistance. Having to check if anyone is using Jenkins, use Google Remote Desktop, sign in to Jenkins, click through Jenkins UI to find the build job, select the correct build settings and remembering to sign out of Jenkins is not an intuitive experience by any means.

It was much simpler experience to log in to GitHub, navigate to the Actions tab and trigger a build. We’ve simplified the build options to include only essential parameters with sensible defaults so that even non-engineers can trigger a build.

Security

It is more secure not having public internet access to the build machines. Interactions with the build machines are done through GitHub Actions while GitHub handles user authentication.

Why not Jenkins in the cloud?

We had a pretty good experience with using GitHub Actions in a smaller project. We liked how easy it was to setup the workflows and how easy it was for non-engineers to navigate the UI. GitHub Actions’s .yaml files was easier to understand than the Groovy scripts. Most importantly, it has nice integration with GitHub.

The transition was simple in large part because of Game.CI. It’s a GitHub Action developed by the community that does the heavy lifting on building and testing a Unity project. Using Jenkins, we had to manually upgrade the Unity installations on each machine. With Game.CI, it’s just a matter of specifying the version in the config and it’ll download the appropriate version to use. Game.CI’s community in Discord was also super helpful — thank you!

What about Unity Cloud build?

Unity Cloud Build was the easiest to get up and running with. Hosting of the WebGL builds was already part of their build process and it was one less thing we had to set up. This was perfect for builds that are temporary.

But integration with GitHub was lacking. It did not have the ability to run tests every time a change was made in a PR and to be able to notify GitHub whether the tests has passed or failed.

It did not have the support for the new Dedicated Server target which can strip out code and assets unnecessary for a server build.

Fundamentally, we didn’t want to be overly dependant on Unity to support our needs. We liked the flexibility of being able to fork the action and customise it for our specific needs. I’ve even contributed a fix.

Self-Hosting

We initially tested GitHub Actions with a smaller project and it worked fine for that use case. But when we tested it out on a more complex project like Disney Melee Mania, the build time took more than 1.5 hours with caching. That was much slower than our builds using Jenkins. This was not a surprise since the GitHub-hosted runners run on dual-core CPUs.

We needed to host the runners on a more powerful machine. But self-hosting meant having to manage how the runners scale. GitHub recommends 2 methods of implementing auto-scaling https://github.com/actions-runner-controller/actions-runner-controller or https://github.com/philips-labs/terraform-aws-github-runner. We chose the latter as it is closer to how GitHub sets up their own runners.

Auto-Scaling With Terraform

Autoscaling setup using terraform-aws-github-runner

The setup for auto-scaling runners using terraform consists of A LOT of steps but it’s well documented at terraform-aws-github-runner. So I’ll just detail what we changed for our own setup.

  • We changed the default instance to run on a more powerful machine. In main.tf ,
instance_types = ["c5.2xlarge"]
  • Store the deployment state in S3,
terraform {
backend "s3" {
bucket = "github-runner-state-linux"
key = "path/to/my/key"
region = "ap-southeast-1"
}
}

The state needs to be stored as it maps the current resources to the configuration. Without it, new resources will be deployed instead of modifying currently existing ones. The state cannot be committed to GitHub as it may contain sensitive data so it’s best to store it a private S3 repository.

  • Modified templates/userdata.sh to install GitHub CLI which our workflow requires. Any other packages that needs to installed that isn’t part of the base image can be added here. In the future, we’ll explore setting up an image with all these packages baked in to improve the runner’s start up time.
# Install Github Cli
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null
apt-get update
apt-get install -y gh

For any configuration options as well as finding out what the default values are, we can refer to the module’s documentation. For example, the maximum runner count by default is 3 and we can modify it by setting in main.tf,

runners_maximum_count = 5

GitHub Workflow

For our WebGL client build workflow, our steps are :

  • check out the branch
  • using Game.CI, make the build with custom build methods that specifies which environment the client should be built for e.g. dev, qa, live
  • upload the addressable assets to Unity Cloud Content Delivery
  • upload the WebGL build to S3 bucket
  • invalidate the CloudFront Distribution pointing to S3 bucket
  • trigger a Slack notification with the status of the build
  • cache assets

For our automated tests, we have steps to:

  • check out the branch
  • using Game.CI, run the test runner
  • format the test results .xml using Extent Framework
  • upload the formatted results to S3
  • trigger Slack notification with the results of the tests.

Conclusion

DevOps Overview

Everything is running much better than before, but it was not all smooth sailing. We had to iron out some kinks with obtaining floating Unity Build Server licenses, and we just noticed an issue with auto-scaling the runners to deal with PRs, so things are still in development. That being said, farewell Jenkins duo: you have served us well.

If you enjoyed this article or found it helpful, drop me some claps, and don’t forget to follow Mighty Bear Games on Twitter!

--

--