GitOps for multi-cluster K8s environments

A single repository approach for scalability and transparency

Published in

Bumble Tech

7 min readFeb 1, 2023

Kubernetes (K8s) recently became popular as a solution to manage ML workflows. It provides features such as load balancing and automatic scaling, as well as ensures that the model serving infrastructure can handle a large number of requests without performance degradation. Considering the amount of real-time predictions we’re making at Bumble Inc., moving to K8s felt like a natural step.

Whenever a company decides to deploy Kubernetes at an enterprise level, the topic of a multi-region operation will often surface. Sometimes services can’t afford transatlantic journeys because of latency SLAs, but to a greater extent, multi-region setup contributes to disaster tolerance of the system. Generally, when it comes to operating across multiple continents, the prevalent solution is to use a cluster-per-region approach. Kubernetes is designed such that a single cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Taking into account region-specific configurations and resource upgrades, managing these environments can prove challenging.

Enter GitOps — a paradigm that provides a convenient way to manage infrastructure and applications. GitOps can help software development teams simplify deployments of cloud-native applications in real-time as a single source of truth for declarative infrastructure and workloads; particularly in the context of ML Operations.

The Machine Learning Engineering team of Bumble Inc — the parent company of Bumble, Badoo and Fruitz -, is focused on empowering our Data Science team in their efforts to improve the overall efficiency and user experience across our portfolio of apps. We strongly believe that a properly designed platform architecture can open a new world of opportunities for Data Scientists’ models, experiments, and analyses.

Using ArgoCD with various configuration transformation tools, we’ve incorporated the GitOps approach in the multi-cluster setting for the Data Science platform at Bumble Inc. This is how we did it and what we learnt.

GitOps for applications and infrastructure

GitOps modernises software management by allowing Engineers to declaratively manage infrastructure and software code using a single source of truth — typically a Git repository.

Potential benefits of using GitOps include:

Easier definition and maintenance of the desired state of your environment, through the use of declarative configuration.
Having Git as the source of truth for your application’s configuration makes it easier to manage workloads and reduces the risk of unexpected errors.
Git’s version control and collaboration features: this can help you manage and track changes in the configuration, making collaboration easier.
Continuous integration and continuous delivery (CI/CD) pipelines to automate the deployment and management of your applications, improving the speed and reliability of your deployments.

ArgoCD and Flux are well-known GitOps tools that keep your Kubernetes applications aligned with the configurations declared in a Git repository. Despite still being in CNCF’s incubating status, both applications are widely used and are proven in their reliability and compliance.

GitOps usage is not limited by application resources alone. Terraform recently became closer to a usual GitOps workflow — a new continuous validation feature was released, increasing its overlap with ArgoCD and Flux. Moreover, a recently introduced tool called Crossplane provides an exciting example of GitOps practices in infrastructure governance. Since Crossplane resources are being defined as Kubernetes manifests — If you combine ArgoCD and Crossplane, you have a full solution for following the GitOps principles with infrastructure.

ArgoCD & Kustomize & Helm

In the Bumble data science team, we’ve deployed ArgoCD on top of Kustomize and Helm, both of which are tools that are commonly used to manage and deploy applications in Kubernetes.

Kustomize lets you customise raw, template-free YAML files for multiple purposes, leaving the original YAML untouched and usable as-is. It allows users to create specialised “overlays” that are used to inject environment-specific variables into a common base. Helm, on the other hand, is a package manager for Kubernetes that allows users to easily install, upgrade, and manage complex applications. Together, these tools provide a powerful way to manage and deploy applications in Kubernetes.

A smooth GitOps experience

Design principles

A well-designed repository structure unlocks the full potential of GitOps. Here are some tips for building one:

Make the proposed repository structure clear and transparent.
Make sure you can easily promote changes between environments, without introducing any security risks (eg. exposing secrets from one environment to another).
Consider having a single source of truth for all dependencies to avoid duplicated code (make use of overlays).
Evaluate the ease of bringing up new environments.
Consider possible privilege separation issues.

Data Science at Bumble Inc. approach

Design overview

We continuously re-evaluated our GitOps approach, and came up with a solution that suits all our current needs. The key principles are:

Three repositories, one per environment: development, staging, and production. Stop Using Branches for Deploying to Different GitOps Environments!
Manage region clusters from a single repository as “overlays”
Manage region-agnostic localizations of base resources from a single repository as “overlays”

High-level repository structure. Source © Bumble

This structure makes it easier to operate a multi-region cluster setup without introducing any privilege escalation problems. But, the main advantage of the suggested structure is its transparency.

Check out an example dummy application built following the described structure: https://github.com/bumble-tech/demo-gitops-repository

Deep Dive: The two-level overlays approach

Now, it’s time now to start looking at some hands-on example. Here, we need to deploy zone-specific configurations of Istio on different clusters.

Repository structure (Istio example). Source © Bumble

1. All the applications belonging to a specific cluster are defined in kustomization.yaml located at the very root of a zone overlay

overlays/zone-1/kustomization.yaml

overlays/zone-2/kustomization.yaml

Therefore, each zone has its own set of resources to deploy.

2. kustomization.yaml, in turn, is pointing to an argocd-applications folder, where resources of a kind: Application are defined.

overlays/zone-1/argocd-applications/istio.yaml

Applications in the overlays/zone-1/argocd-applications folder contain instructions on how to build specific applications for a specific cluster. In the resource above, we see that the overlays/zone-1/istio folder should contain overlay instructions for deploying Istio for zone-1.

If a new application is being added to the cluster — base resources are specified in the “base” folder, while zone-specific resources are declared in the zone overlay

3. The resource overlay folder contains instructions on how to combine base and overlay resources of the specific application to be deployed in a specific zone.

overlays/zone-1/istio/kustomization.yaml

4. Home for the istio resources for bumble, which is base/istio/bumble, contains instructions on how to combine base istio installation with a bumble-specific, but cluster-agnostic overlay.

base/istio/bumble/kustomization.yaml

Introducing another overlay level brings great value when it comes to upgrading third-party resources and libraries. To upgrade manifests — you’re simply replacing the base/<resource>/base folder with the new one published by the community, and rolling changes out to all clusters with a single commit, maintaining all your overlay tweaks. It could be trickier if changes in the new version are overlapping with your environment-specific patches, but at least you’re explicitly declaring how your version of the resource is different from the one provided by the community. The latter contributes a lot to team collaboration; next time you’re enjoying your holiday, you can rest assured that you won’t need to help your team upgrade cluster resources you’ve deployed previously, since all changes to the original source will be explicit.

All of this isn’t limited to kustomization, since the same overlay structure could be achieved using Helm Charts as well. For example, how kube-prometheus-stack values are separated between zones:

overlays/zone-1/monitoring/argocd-applications/kube-prometheus-stack.yaml

This structure allowed us to improve the overall readability and transparency of the cluster resources.

For further automation, take a look at an ArgoCD ApplicationSet controller which allows you to dynamically generate ArgoCD Applications using templates.

Conclusion

As Machine Learning Engineers at Bumble, our goal is to provide Data and Machine Learning Scientists a platform that is going to empower them and speed up their delivery; allowing them to focus on users’ experience, rather than figuring out how to deploy stuff to production.

For this reason, we’ve decided that our GitOps environment will follow these three pillars:

Repository-per-environment (development, staging, production)
Region-specific cluster resources are managed as “overlays”, sharing the same environment repository.
Region-agnostic localisations of base resources declared as “overlays”, sharing the same environment repository.

The proposed structure brings:

Transparency: Each cluster has a list of the resources intended to run there.
Scalability: New environment could be added easily
Extensibility: The process of adding a new resource is transparent
Collaboration: Both cluster-specific and cluster-agnostic changes to a common base are described in a declarative way.
Convenience: Resource upgrades require a single commit to be rolled out to all overlay clusters; overlay-specific parameters are declaratively defined.
Security: Repo-level access control improves security over branch-based setup.

We would love to hear more about your approach to handling extensive ML workloads. Let us know what you think either in the comments section, or reach me out on Linkedin.

Special thanks to Matthew Healey, Stephen O’Farrell and the Bumble MLE team.