DevOps in Review 2023.1

Published in

tech-gwi

7 min readOct 2, 2023

There have been a lot of interesting things happening lately on the DevOps team and we’ve decided to share the review of the first semester of 2023 in this blog post, to present you the latest and greatest features that are coming to our infrastructure and how we’ve got your back.

The DevOps business case comes with benefits and processes to an organisation that adopts that mindset, however for this semester our focus was reducing costs, but also experimenting with new technologies. In this post, we’re going to present in greater detail what we focused on in the first part of the year and what the tangible results were.

However, reducing cost doesn’t give you the whole picture of the awesome stuff that we worked on this semester. We played with cutting-edge technology, ranging from OPA policies to Terragrunt and LinkerD. Let’s dig down to each item in more detail.

Cost optimization

A process that is covered by DevOps is monitoring, which includes cost management and process gatekeeping. This is a more direct approach to the problem because GCP resources are charged in a pay-as-you-go model. There were a lot of savings that could be made by removing obsolete resources or tweaking misconfigured resources.

Infracost

We’ve integrated Atlantis with Infracost, which is self-described as the cloud’s checkout screen (if you want to learn more about Atlantis and how we use it, please refer to this earlier blog post).

This is visible on pull requests on our Terraform code and will give cost information about resources added.

One of the first things I noted down when I joined the GWI, after hearing my colleagues talk about it in our daily meeting, was a question on a pull request review that we did as a group that sprung from our guild: “Is the cost increase approved?”.

It’s very important to remember that companies that are using a cloud provider for their infrastructure, need to have budget reviews; thus, reviewing Terraform code isn’t only about correct code, it is also a budget review. It’s not only approving the cost increase, it’s also doing everything you can in order to properly configure the limits.

Thus Infracost provides a handy little checkout, which gives much power to engineers to do their work properly and help identify resource misconfigurations before anything is applied.

OPA

Short for Open Policy Agent, is the tool of choice for implementing policies in our github workflow when dealing with Terraform code.

Thankfully, OPA can be integrated into Atlantis, which reaffirms its introduction last year in an effort to bring some much needed structure to our infra workflow, for which you can read a more in-depth article here.

An example policy that has been created is the enforcement of cost restrictions based on what has been generated by Infracost.

Another policy that was introduced lately is the blocking of merging custom resources in our Terraform code. As it has been mentioned previously, the process of migrating to modules is almost done and most projects are already migrated, which is why we need a policy that will remind people not to use resources, but switch to a module. The process is almost 99% the same anyway and it comes with additional benefits that are very important.

More OPA policies are in the pipeline to be introduced, that are mostly sane defaults when dealing with Terraform code, like only apply on mergeable PRs, only apply on approved PRs, etc.

Kubernetes Resource Optimization

This is actually a technical directive that has been kicked off by the DevOps team, which included a Grafana dashboard to help teams identify their services and then fine-tune them.

The dashboard above is an example of how it is visible in our grafana. This dashboard is pulling data from various different sources to achieve something like this, like google monitoring, kubernetes prometheus data and GKE’s cost allocation, which breaks down cost per Kubernetes namespace.

Of course, there’s a big gotcha here, and this is where we need our engineering teams to be involved in order to analyse the data that we produce — cost allocation is something that is calculated per pod requests, which assumes de facto that the pod’s resources are valid. Without proper analysis, this may produce invalid data.

Cost saving via resource efficiency

AKA: DevOps new and shiny toys

In their whitepaper “How to Measure ROI of DevOps Transformation”, Google identified two avenues for cost reduction. The first was improved efficiency through the reduction of unnecessary rework and the second was the potential revenue gained by reinvesting the time saved in new offer capabilities.

Keeping this in mind, our role as DevOps is to remove clutter, unblock the other teams so that they can focus on faster development, without sacrificing quality and security.

Here are some new technologies that we implemented that can facilitate the development teams to focus on actual development:

Linkerd

Linkerd is a service mesh, that is actually CNCF graduated, which means that it’s secure and mature and comes with great tools out of the box, without the need to be implemented, like mTLS security on the communication between applications in a cluster, as well as metrics like success rate, latency, request volumes, etc.

Linkerd also comes with some additional benefits like load-balancing that is latency-aware and blue-green deployments.

Istio is another solution we have reviewed (as well as Anthos, which is GKE’s managed Istio), but after careful consideration we decided that Linkerd was the way to go due to its simplicity.

Terragrunt

Terragrunt PoC is the next big thing we’re currently working on. Terragrunt is a tool that will help keep our Terraform code DRY, (Don’t Repeat Yourself) and supports the deployment of all the modules of a project in dependency order, which will simplify deployments and reduce errors. Terragrunt has been very high in our list of new technology adoption and has been the driving force behind investing resources in the Terraform modules transformation.

Terragrunt will prove to be useful in various ways — for example we are expecting it to reduce Terraform code by 30%! It’s also going to be helpful on next quarter’s disaster recovery tests, as its ‘apply-all’ (which also supports running modules in dependency order), is going to remove a lot of manual intervention by the DevOps engineers.

Technical Debt

Terraform Modules

In the first semester of 2023 we’ve done a great deal of work on tackling technical debt, by migrating more than 85% of our Terraform code into Terraform modules, the self-contained, group-managed packages of Terraform configurations, which translates to many thousands of Terraform resources!

Of course, we’ve got a strategy in mind. This is a big investment that we’re making and we’ve carefully leveraged the benefits vs the time needed to make the changes.

First of all, the transformation will greatly benefit our way of working. This is a concurrent environment and we need to be mindful that if multiple people are pushing code that affects the same environment, their work might interfere with each others’. However, by breaking the code down into modules, the granularity of this exact problem becomes smaller — it’s two people altering exactly the same module. Thus, it doesn’t make the problem disappear completely, but it reduces the chance of it happening often.

Second, speed matters, especially when you have to push a lot of changes. The module transformation breaks down a monorepo in multiple state files, which in turn makes Terraform plan a breeze — if you change a single access policy, you no longer have to wait for the plan of a whole project to finish; you’ll have to wait only for the resources within the module that you are altering!

Second, this is a much needed step before proceeding to the adoption of other tools, like Terragrunt, described above, which will in turn provide additional benefits and abstract away difficulties that we’re currently facing.

Final Words

AKA FinOps?

Our cost optimization initiative was kicked-off last semester and our ideas involved mostly low-hanging fruits. After working on it for about a month, we were able to achieve a reduction of infrastructure costs of about $7600 per month, which is approximately 10% of our monthly Google Cloud Platform bill!

This is by no means the end of cost optimization, as it’s an ongoing struggle, but at the same time, it’s not an activity that we will revisit every six months. It is a great experiment that shows that our infrastructure is mature enough and at a point that will benefit greatly by the adoption of FinOps. FinOps is not about reducing spending, but it’s about cultivating a culture of financial accountability, getting the most out of every buck spent and enable our people to make more informed financial decisions.