ReX: With GitOps, stop playing with your cluster!

Published in

ADEO Tech Blog

10 min readApr 1, 2021

These days, there are terms we hear very often: DevOps, GitOps, Kubernetes … but what is behind these terms? Is there an existing tool/technology that can meet all the needs? Is GitOps a magic wand?

At ADEO, we have implemented a technical solution that aims to allow us to apply our configurations, ensure traceability of all changes made to our environments, and guarantee the homogeneity of our clusters.

On November 19 (2020), three members of our team (Vincent Guardiola, Franck De Graeve, and Maxime Triquenaux) presented our ReX (feedback) during the CloudNord conference on the implementation of the infrastructure dedicated to the European marketplace, Leroy Merlin.

Video of the talk is available on YouTube: https://www.youtube.com/watch?v=nSho1GwfFOw

Context

Let’s start this ReX by explaining the context of Adeo.

Adeo is 500 million customers, 26 independent but interconnected companies present in 15 countries (Leroy Merlin, Weldom, Zodio…), and in order to work, the most important thing at Adeo is people: we have 120 000 employees who are there to help and advise our clients in their projects.

And thanks to them (and a little to you for your purchases at Leroy Merlin ^^), Adeo is the European leader and number 3 worldwide in “home improvement”.

Adeo is a worldwide company, with autonomous BU (Business Units) who have their own products (more than 50), teams (hundreds of developers), and the main goal is to make everyone work together on the same project to move from a local mode to a global mode and therefore build a European / global marketplace in the inner source.

You should know that to build this kind of project, we need to harmonize the way of working, the code, and therefore the technical platforms, so naturally we started with standards, like containers Docker and Kubernetes.

Constraints

So our challenge was to set up a new Kubernetes technical platform but with fairly strong constraints:

Response times: one of the technical objectives of this European website is to be able to manage huge traffic (100,000 req / s)
Operational excellence: we are talking about a European website here, we cannot afford service interruptions, we risk loss of turnover and brand image, loss of customers.
We will talk about Multi cluster, in multi region, in multi Cloud providers, for robustness, high availability and to address several countries in the world, including France, Russia and Brazil.

The team has been using containers for years and we offer RedHat OpenShift-based platforms. For this project, we made the choice to manage as few services as possible and focus on our added value: having a Google partnership, so we chose their GKE managed offer.

We already had experience of cluster management with OpenShift but rather based on CI / CD, so we wanted to use our experience to manage our GKE clusters.

Our story

We started our experience with Kubernetes with 2 clusters, based on OpenShift. Our first instinct was to put everything in Git.

Whether it is our YAML manifests, our application deployments and our scripts. It worked well, when we needed to make a change, we started from our Git repository and applied our scripts and manifests. Cool, but we made a lot of manual actions.

Quickly we grew to 6 clusters. At this time, performing manual actions is no longer possible, so we made the choice to create CI / CD pipelines with GitLabCI to deploy our manifests and operate our clusters. The problem is that despite the pipelines, we still do manual operations on the clusters. And when manual manipulations are made on the clusters, despite CD pipelines, we arrive at an inconsistency of what is versioned on Git and what is really deployed and running in the cluster.

The certainty that we had gained through this experience, is that everything had to be on Git and that this had to be the central point.

Then, in September 2020, we have started our thinking on GKE, to meet the requirement of the European platform and we knew what we wanted:

no more manual deployments
no more manual actions on our clusters (except for debugging needs)
resilience of our applications
self-healing of our applications
immutable deployments

It’s for these reasons and needs that we choose to move to GitOps.

GitOps?

Disclaimer: GitOps is not a magic wand ;).

The term GitOps first appeared at the KubeCon conference in 2017 by Alexis Richardson, the CEO of Weaveworks, who explained that GitOps is not “push containers” but “push code”.

GitOps is based on 4 fundamentals:

Declarative
Versioned configuration in Git
Pull Requests
Automation

With GitOps you can say goodbye to executing kubectl commands and scripts in your clusters. Everything is on your Git repository. Git becomes our only source of truth.

When we’ll do a Pull Request (or a Merge Request), a deployment will be made in a cluster. The advantage is that each change is verifiable and observable.

But to validate compliance between what is on the cluster and what is on Git, Git is not enough. The solution is to use a “controller”.

If you are interested in GitOps, we recommend the “Guide to GitOps” written by Weaveworks.

GitOps implementations

With GitOps two approaches exist: Pull or Push.

The more traditional “push” strategy consists of using your forge to “push” the changes you just made in Git.

Tools using this push mode:

Jenkins X, Gitlab CI/CD, GitHub Actions…

The “pull” strategy is to let the infrastructure self-manage, through an operator, that will “pull” (synchronize) the Git repository and self-apply. In particular, this makes it possible to dispense with forging for the CD part.

3 main technologies for Pull mode: Anthos Config Management (ACM), Weaveworks Flux, and Argo CD.

Pull strategy

At Adeo, we chose the pull mode because it is more suitable for Kubernetes and above all it is self-managed.

There are two types of controllers for the Pull strategy:

Controller in the cluster

Tools: Flux, ACM
One controller per cluster
One Git repository per instance
Cons: More complicated maintenance and updates

2. External Controller

Tools: Argo CD
Multi clusters
Pros: Centralized management (which will allow deployment on several types of clusters)

Argo CD: our choice for GitOps!

Why?

As we said, it supports multiple clusters and Git repositories
Provides a UI.
API / CLI is intended to be invoked from pipelines
Designed for multi-tenant access / control (SSO, OIDC bindings, RBAC, projects)
CRD (Custom Resources Definition) operates at finer grained level
Flexible in templating: Kustomize, Helm, vanilla Kubernetes YAML manifests, Ksonnet, Jsonnet, replicated ship)
Automatic or manual synchronization

How?

We deployed Argo CD in one cluster, dedicated to replicate our configuration in all of our clusters.

How does Argo CD work?

That’s all? No!

Ok, it’s cool we versionned everything in Git, because it’s the source of truth, everything is deployed through GitOps but … something is still missing in our stack…

When we started doing GitOps, the first problem we encountered was the huge amount of manifest files.

One manifest per object per cluster is not possible. If we change the version/tag of the Docker image or the number of replicates (Pods/workloads), we have to do it for each manifest of each cluster… Yes, it’s not a solution over time.

So we decided to look at another new tool: Kustomize!

Kustomize

Kustomize is built-in kubectl since Kubernetes 1.14 and declarative like Kubernetes.

In order to understand Kustomize:

The aim is to add layers modification on top of base in order to add functionalities we want.

It works like Docker or Git: each layer represents “an intermediate system state”.

Each YAML files are valid/usable outside of Kustomize.

Kustomize will allow us to customize the configuration of our deployments while keeping a common base and create a manifest with just the specificities of the clusters.

Concretely, we have a common base which is the deployment of our application and then we will have one file per cluster with just the specificity of the IngressRoute for example, or a number of replicas.

Let’s do this!

Perfect, so we chose awesome tools, we will do GitOps, it will solve all of our problems and everything will be working successfully… or not …

Let’s talk about an overview of the challenges we faced during our migration to GitOps and components we manage:

CMDB Operator

We developed and deployed a Kubernetes operator who feeds a CMDB (configuration management database) in order to do usage-based billing. We need it because we charge our users for the use of the cluster according to the CPU load.

Ingress Controller

After a phase of studying the performance of the various existing Ingress Controllers (Istio, Gloo, Traefik, HAProxy…), we chose TraefikEE (Enterprise Edition of Traefik) because it feats with our needs and responded in less than 20ms for 100,000 requests seconds in HTTPS. It also handles Let’s Encrypt certificates.

Security and control

We are on shared Kubernetes clusters, so we have components for security and control. We have developed a Kubernetes operator who sets a Network Policy at the creation of each namespace in order to isolate them at the network level. We also created PSP (Pod Security Policies) which prevent privilege escalations and an operator allowing us to manage secrets in GitOps.

Google Config connector

We use Config connector (a Kubernetes addon) that allows us to manage Google Cloud resources through Kubernetes for mainly IAM resources, in order to give access to our users on our clusters.

Test applications

We also created and deployed test applications, every 5 minutes, that deploy end-2-end tests, in order to guarantee our SLO in our clusters.

These tests included:

namespace creation
deployment creation
public and private exposed routes for our applications
certificates validity check

Logs

At Adeo, each team manages a product, we delegate the management of logs, monitoring … to third-party teams via GitOps. So we create namespaces, ServiceAccount, Role and RoleBinding for third-party teams through GitOps.

Houston, we have problems!

Yes, the theory seems always easy, but, in the real world, we always have troubles. To do all the things we talked about, we encountered difficulties, problems and questions:

Versioning in GitOps

The first problem was how to test, ensure a change every week, and of course, how to roll back, in one click.

We have implemented a Release Management process:

Each User Story (US) is developed in a specific branch, then tested by another member of the team.
Every beginning of the week we deliver the US validated on master, then we create a GitHub release with an immutable tag.
We qualify this new version on our staging environments, then we deliver it in production clusters as a second step.

Secret management

Another problem is the management of Kubernetes Secrets in a secure way. We use GitHub in innersource and our repositories are therefore visible to everyone so it was no question to commit Kubernetes secrets publicly in Git repositories. A secret in Kubernetes cluster is encoded in base64 but not encrypted!

Secrets are cool but what about encrypted secrets with kubeseal?

The goal of kubeseal, a bitnami tool, is to encrypt your Kubernetes Secret into a SealedSecret. Thanks to that, it will be safe to store in the public Git repositories.

The SealedSecret can be decrypted only by the controller running in the target cluster and nobody else.

A sealed-secret-controller runs in the Kubernetes cluster. He listens when a new SealedSecret object appears, unsealed it (thanks to known certificates) and creates a Kubernetes secret in the same namespace as the SealedSecret.

The SealedSecret can be decrypted only by the sealed-secret-controller.

TraefikEE installation & configuration

Another difficulty encountered is the application of the Traefik configuration. The configuration can be only applied in CLI (command line) via the teectl binary. And as we saw, with GitOps we can’t deploy things through scripts or CLI tools.

So we collaborated with the Traefik team and as a workaround, we apply Traefik configuration via a Python script.

It worked, but it was “just” a workaround. Fortunately since version 2.3 of TraefikEE, the configuration is now applied via a Kubernetes ConfigMap, which is a better solution :-) .

Conclusion

As you can see GitOps it’s a very interesting topic and our journey was not easy but we’re very happy to have done this transition and we continue every day to try to improve it.