Getting a handle on hosting costs for CI/CD systems

Published in

MPB Tech

5 min readMay 11, 2023

A photographer wearing a hat and blue waterproof looking through their camera while sitting on top of a mountain surrounded by clouds — Photo by Alif Ngoylung

By Akshay Kilaru, Site Reliability Engineer at MPB

Today’s continuous integration and deployment systems allow software developers to work more efficiently than ever before. Or at least, they do when they’re fully optimised.

One big efficiency gain for us here at MPB has been the use of feature environments. Once, developers would have created new features on Staging or Dev servers, then dealt with the inevitably unforeseen bugs on moving to Production. Instead, they now work on a temporary facsimile of Production from the start. Every branch gets its own auto-configured feature environment and a whole class of bug — caused by dissimilar environment stages — is automatically squashed.

It’s a great leap forward, but not one that’s easy to explain to your accounts department. Cloud hosting charges can be difficult to predict and manage at the best of times, but introducing new environments on a daily (or even hourly) basis makes things much more challenging. Not to mention potentially expensive.

Some optimisations suggest themselves immediately. You can auto-delete feature environments after their allotted lifespan. Through trial and error, you can pare them down to reduce disk space and virtual hardware, while still ensuring enough parity to prevent major issues when features are promoted.

But what can you do if, after all that, your hosting bill still has your colleagues in finance slowly shaking their heads?

Quite a lot more, it turns out.

At MPB, we found the best way to troubleshoot costs and go beyond the lowest-hanging fruit was to prioritise the issue over a one-week sprint. We focused on three approaches, which together have resulted in a remarkable 71% cost reduction.

1. Pre-sanitised database backups

The problem: We were deploying feature environments, which included Cloud SQL database snapshots taken from Production. These databases contained transactions and legacy application data, and were sanitised so developers wouldn’t be working with vast amounts of real customer data.

The snapshots, being images of the original Cloud SQL disks, were the same size as those on Production, regardless of the actual amount of data they contained. As a result, each feature environment ended up with a 1TB+ hard drive for the database, even if it only contained 50GB of data.

So we faced excessive disk space requirements for feature environments, but the problem didn’t end there. Because the database snapshots were run manually and ad-hoc, schemas became old and outdated. Solving this required longer and longer schema migrations, making set-ups increasingly complicated and time-consuming.

In addition, our database stack comprises a number of parts sitting on different technologies with complex interdependencies, making restoration of an individual part more complicated still.

The solution: We created a pipeline that would snapshot and sanitise all our databases simultaneously, run automatically once a week. After sanitisation, a SQL dump is created from each database and stored in a bucket. These dumps enable us to create new Cloud SQL instances with appropriately sized disks, restore the SQL dumps to them from the bucket, then snapshot these instances.

The snapshots are then used to build feature environments, ensuring consistency between interdependent databases and saving on storage costs. Start-up times are also reduced by a third as feature environments’ schemas are never more than a week old.

2. Kubernetes cluster optimisation

The problem: We were running feature environments with node autoscaling switched on and with an unoptimised machine type. As a result, random autoscaling made billing unpredictable and the cluster consistently contained more space than was needed.

The solution: This piece involved a lot of fine-tuning, but created substantial cost savings. We were already tearing down our K8s cluster overnight and rebuilding it in the morning. We now create feature environments with a six-node limit and a standard machine type chosen for optimum scalability (it is now on our backlog to optimise Production usage based on what we learned). And we spent time tweaking the resource limits of various charts on the back of this work. The result was a 23% cost saving from this piece of work alone, and we’ve yet to enable Google’s own GKE cost optimisation dashboards.

3. Cloud region optimisation

The problem: We sometimes talk about the cloud as if it were an insubstantial entity, evenly distributed on a global scale. The reality, of course, involves internet backbone links and data centres dotted at strategic points around the world, each with a unique set of local conditions. And that means they face different costs and charge accordingly. Moving infrastructure between these locations might cut hosting costs.

The solution: Of course, it isn’t sensible to simply look at the provider’s list of data centres and pick the cheapest. If you aren’t where your customers are, they could face latency issues. And besides, there’s more to consider than simply cost. There’s environmental impact, legal and compliance issues, distance from your own base, and the logistical headache of shifting your infrastructure.

MPB had always used the closest data centre, europe-west2, located in London. Upon investigation, we found that europe-west1, in Belgium, is not only cheaper but also runs on low-carbon electricity. Moving our feature environments there would not only mean a 16% cost saving but would also help meet our company’s sustainability goals. Win-win.

We were looking at feature environments in this sprint, but there’s clearly room to increase the scope for greater carbon savings all round. So, we are now planning to move more of our infrastructure.

Final thoughts

If you’re trying to reduce hosting costs, you could clearly look at these three headline items and leave it there. But I think the real learning is about making a FinOps workstream part of your regular agile development process.

Experienced site reliability engineers and software architects know the focus areas most likely to achieve results, but may not have the time or support to prioritise the work. Finance teams know what they’d like to prioritise in order to hit their targets, but not how feasible their ideas might be or what else could be done instead.

The timing of our FinOps work was significant, coming ahead of the due date to renew our hosting agreement and helping us to determine our future commitment level.

There were certainly more ideas that didn’t make the final cut for this sprint but which might be revisited. What we found is that bringing these different parts of the business together in an atmosphere of trust can make great things happen.

Useful links

Akshay Kilaru is a Site Reliability Engineer at MPB, the world’s largest platform for buying, selling and trading used photography and videography kit. https://www.mpb.com