Kicking the habit: Jenkins detox in production

Matic Miklavčič
Outfit7
Published in
11 min readMar 2, 2023

TL;DR: Moving the O7 game build system into the cloud opened up new possibilities for individual game development teams around scaling, testing and fixing bugs. We improved build times, reduced system complexity, increased stability, and most importantly, increased the number of builds, ensuring finer-grained testing and bug tracking. Best of all, we did all this without interrupting regular production.

Ever wondered what it’s like to be an engineer changing the undercarriage of a bullet train…

…with nothing but a multitool, blindfolded with your hands tied?

…after not sleeping for a week?

…while the train is trying to break a speed record?

Ummm, wait a second. You guys test stuff in production?!

Why of course we do!

Hi, my name is Matic Miklavčič and I own a multitool! Erm. I mean… I am a DevOps engineer at Outfit7. I’ve been a part of the company for about six years now and I can proudly claim that there are bugs of my own making in most of the codebases across the company.

Below, I’ll talk about how we turned the entire build infrastructure upside down, without affecting schedules or causing downtimes.

But first of all, let’s get cooking!

In all seriousness, a game is basically a bunch of code, with some art and sounds thrown into the mix. A game needs to monetize, which is just a fancy word for… yet another bunch of code. A game also needs to report some numbers back to its makers, so that they can make wild claims about the huge numbers of users, downloads, and awesome stats. This is — you guessed it — just another bunch of code. Putting it all together is like making a lasagna: A layer here, a layer there, bake it, eat it. Easy enough, right?

Now, imagine that each layer is provided by a different chef. To top it off, each chef does their thing in their own time. Some may add too much salt; some may decide to cook another recipe entirely; others might forget their stuff in the oven and deliver a pile of smoldering embers. This is further complicated by the fact that the lasagna needs time to bake before you can check if it is any good. In our case, this means combining stuff like in-app purchases, analytics, ads, different supporting libraries (and the occasional game code) into a — hopefully — working final product.

To add to the pressure, the lasagna we’re making needs to appeal to everybody. It needs to be delivered at the correct time, at the right temperature, to the right consumer.

[To monetize: A product must provide income to the company. Money is a substance used to periodically feed the developers, ensuring regular code output. ]

In this analogy, our build system is the oven.

An oven can accommodate a single lasagna at a time.

If there are many chefs working without proper coordination, the lasagna will be terrible, because everybody will be scrambling to use the oven at the same time..

If there’s only a single master chef, he’ll likely quit due to the pressure.

In both these cases, consumers get bad lasagna. It’s either late, cold, poorly-made, or all of the above.

Our old build system grew with the company; organically, without much thought being put into it. It was hosted on the premises and worked fine for some time. Then, our appetites grew. And our company grew, too. Suddenly, we were no longer building two or three projects, but twice or three times that. Our build machines were either bored or overburdened. Our build engineers were becoming stressed out like that master chef I mentioned earlier. The oven was never clean enough, never the correct temperature, or just not available because somebody else was baking stuff in it. We needed to do something.

The legacy build system

Scaling up (after eating all that lasagna)

It is just so convenient, this lasagna allegory — I’ll push it a bit more.

To make sure each lasagna is perfectly cooked, we must allow each chef to manage their own cooking. Having a master chef keeping an eye on everything, or having many chefs with a single oven is not optimal. So we can take advantage of the fact that we are not really doing cooking here — instead, we’re writing software. And the best thing about software is that it can be copied. Multiplied. Infinitely.

So — why not make more ovens? In fact, let’s make a million. Furthermore, let’s make it so that each oven is available at the exactly the right time, preheated to the perfect temperature and ready to accommodate a chef instantly. At the same time, the oven will take care of alerting the chef when the meal is done, and do the cleaning up after itself. Instead of opening the oven, taking out the hot tray and cutting up the fresh lasagna, let’s make the oven do that automatically. The only thing left is for the waiter to do a quick check and take the fresh delicacy to the consumers.

If we put this in the perspective of the Outfit7 build system, this means that instead of using a few worker machines, stored somewhere in our office premises (the actual location lost due to one of our IT specialists refusing to tell, unless we buy him a puppy), we are now renting worker nodes on an as-needed basis. Instead of having to care for the machine environment (and inevitably breaking stuff during simple updates), we now use a tested, predetermined environment in the form of a Docker image, specific to the job at hand. The Docker image is now our oven, and we can instantly rent a bunch of worker nodes, fire up the images on them, wait for them to quickly warm up and give each of them a single task to perform — be it a single library build, or even an entire game. Better yet, the build system will take over all the heavy lifting, which means we have reduced a complex game build to a few lines of easy to read YAML.

Code (complexity) reduction with the new system due to redesign.

We have now provided our chefs with the ultimate tool, and they can focus on building the perfect lasagna, putting the correct layers together, and trusting the system to take care of the rest.

It is important to empower our developers to do what they do best, with easy-to-use tools that need as little maintenance as possible. They are great at making games, and they should spend their time doing just that, instead of fighting with the tools.

Enter the cloud

We needed to do several things to make our builds scalable.

Firstly, we needed to standardize. Our games at the time were really individual products, developed by individual teams. Even the build scripts were written by the developers themselves, which meant that the code was not consistent between projects — making that having a standardized set of commands to build each of the projects proved impractical. If we wanted to have one system to rule them all, we needed to use the same consistent method to build each and every game. We slowly transitioned from having individual games use their own customized build scripts to something generic — meaning that it was used in the same way across our portfolio.

After a few planning sessions with the game teams, we started to standardize our environments. A natural progression was to containerize our environments, which enabled us to scale them better. At the same time, it made sure that our developers did not need to worry about the environment being predictable. (Imagine arriving at work bright and early at 10:45am, only to find that somebody installed Java 17 on the worker machines and made it the default option. Now, your builds don’t run and you are giving me a hard time.). If we wanted to have our builds fast, we needed to reuse certain parts of previous build jobs. In case something went wrong, we did a “workspace clean,” which erased the whole workspace for that game on the worker machines and made sure that a fresh batch of code was checked out from the repository. Now, every single build starts in a “clean” container, because the container starts up without any trace of the game code. We do a fresh checkout every time and just restore cached pieces of it on the fly. This had the added benefit of reduced coffee consumption, less stress and so on.

At the same time, we were looking for a cloud-based solution that would replace our on-prem Jenkins, which we’d been using since forever. The idea is that if our developers want to build a thousand builds in a given moment, they must be allowed to do it without us having to maintain a bunch of host build machines. If we need only one machine, we’ll rent one — if we need a million, we will rent a million (and risk facing the inevitable wrath of upper management after they see the monthly expenses).

Testing in production

At the start of this blog post, I was bragging about changing critical systems while in active production. This was initially a big challenge: As the mass of people grows, their flexibility diminishes (he he). So, having a large number of people switch over to a completely different build system overnight, without causing headaches and downtime, was a major challenge. At the same time, we needed a better way to redistribute the built apps within the company, so that we could also keep our QA busy. We needed a way to simplify the installation process somewhat. Instead of distributing files that needed to be manually installed, we wanted something easier. An easier solution has already been invented, one that we use daily in the form of app stores! People sometimes prefer playing games to working (weird, right?), so what we wanted was to improve the ease of access of all of our builds. This gave birth to our very own internal app store, which we call AppHub.

What app hub would an AppHub hub, if an AppHub could hub apps?

While we all use different app stores regularly, the ease of use comes at a cost. If you want to create your own store, you need to overcome several technical challenges. It’s not just serving a list of downloadable games (or artifacts, as we like to call them) and having them downloaded to your phone and magically installed. There’s more to it. Like our lasagna.

Besides having to deal with authentication and user privileges, you need to handle different installation methods for different platforms.

For iOS, this means having to deal with its app installation system, which, instead of an application, requires a .plist file that points to the application, with no option to provide authentication (yay, it almost rhymes). This means that, when serving the .plist to the requesting client, we need to create a pre-signed URL and expose the installable application package, then have it invalidated after a certain amount of time.

For Android, it gets even more complex. Here, we have to deal with easily and directly installable .APKs in addition to the much more complicated installation of directly-not-installable .AABs (Android Bundles). Since we might decide to build our application as either a bundle or an APK on the fly, the distribution system will never know what to expect and must handle both equally. This means that upon receiving a client request for an installation, the server will have to unpack the bundle, generate an installable package for the exact device that requested it, and sign it with the correct certificate.

What’s more, we wanted to migrate to the new build system and new distribution system seamlessly. When we decided to introduce AppHub, we were still using Jenkins to build most of our apps. This means that AppHub will need to serve stuff from Jenkins and the upcoming CircleCI-based build system. We wanted to spare our users from having to deal with both switching build systems and distribution systems at the same time.

Looking at the problem from a few steps farther away pointed out the obvious solution. Because our DevOps team is awesome, we decided to run not one but two build systems in parallel. The old grandpa Jenkins would be forced to work for a period after its retirement, while we set up a fresh system based on CircleCI beside it. We would then migrate, team by team, game by game. In effect, a game team would be using two systems for a certain time period, providing valuable feedback on the new build system that received incremental buffs, while still being able to fall back to the old reliable Jenkins if something goes wrong. In this way, we moved the first two games over to the new build system, which eventually freed up some burden from the old arthritis-ridden Jenkins. It also meant that our new, sparkling distribution system started populating with some artifacts, which meant that our people got familiar with it. Eventually, everybody saw the benefit and started asking questions about migrating. Yay, no more pushback! Instead, we had positive reinforcement!

As we got the ball rolling, we could afford to move other game teams more quickly. Because the scripts were now standardized, game teams were able to perform certain tasks themselves, which stopped the over-reliance on the DevOps team.

What is the benefit of the migration?

The DevOps team is now able to assume a supporting role, helping game teams to take ownership of their projects in a totally new way. Plus, they’ve started to take care of their builds as well.

Distributing the load means that the DevOps team can now focus on developing new features that our game teams can utilize, as well as research and improve build performance and reduce costs. Having builds run in the cloud also means that there is no more waiting queue. Teams no longer affect each other by hogging resources, but are instead completely independent. We now spend less time (and money) than with Jenkins, while doing more builds, and while making those builds a lot more stable. And, since the systems are now so much simpler, they can be maintained by common mortals, instead of those build engineer ghouls that everybody feared not so long ago.

That’s all from me! Now, I’m off to try and explain to my manager that my job only looks easy…

How did you solve your build and deployment scaling problems? Let us know in the comments!

--

--