How we scaled our staging deployments with ArgoCD

Published in

The Qonto Way

8 min readDec 15, 2020

How do we deploy a full environment composed of ~100 containers in around 3 minutes?

We adopted Kubernetes two years ago, allowing us to quickly deploy, scale, and improve our fault tolerance capabilities and lower our infrastructure maintenance costs. We’ve been leveraging this level of infrastructure abstraction across all of our workloads, including our internal tools. In order to help our engineers ship quality fast, we have to structure our deployments. Today, we will show how ArgoCD puts this amazing feature within our reach!

Feature branch environments

At Qonto, we have two main environments:

production: our main production cluster, used by our beloved customers
staging: a cluster consisting of feature branch environments (acting as QA/development environments) and a static environment used as the last environment before production (nearly always synced with production versions)

Feature branch environments are on-demand platforms used by developers to test their code in almost real-time. Any git branch has its own associated environment which can be used to debug, test, reproduce bugs, showcase new features, or even load test some scenarios. It can also act as a real gateway to share knowledge and to give feedback from current development work before production.

This kind of model and strategy has been adopted since the early days of Qonto. However, as the developer team got bigger and the number of microservices drastically increased, the original pipeline model did not scale and took up to 30 minutes to deploy which reduced teams’ productivity.

Our previous deployment model

We use Helm as an abstraction layer to template our Kubernetes YAML resources specifications. The previous model relied on having all of our Helm resources in a centralized configuration repository and deploying them manually on Jenkins.

It worked for a few microservices but as their number increased so did our codebase complexity. Some services were dependant on each other. For instance, service A needed service B to be available to successfully start. As such, the pipeline had to be divided into synchronous stages leading to increased deployment time.

Old manual Jenkins pipeline to deploy a feature branch

Introducing GitOps

We have been searching for the right tool. As our full configuration was already stored in Git and PRs were in place, it was in the continuity to link our deployment model to our code. GitOps seemed to be in the path of what we wanted to achieve.

What is GitOps ?

Git as a single source of truth of a system
Git as a single place where we operate (create, change, destroy)
All changes are observable/verifiable

In other terms, GitOps is a model in which deployments and Git are highly tied together so that it becomes the main gateway to operate platforms.

For Kubernetes, we found two GitOps operators which are Flux and ArgoCD. After comparing the two, we have chosen ArgoCD for two main reasons:

its web interface allowing us to debug and see our deployments in real time
its hierarchy representation with applications associated to projects

Adopting ArgoCD

To migrate our model to GitOps and increase our deployment time, some preparative work needed to be done first:

Independent: change of configuration of an application should not have any side effects on other applications
Time: each application should be deployable in parallel; therefore, removing services dependencies
Infrastructure aware: each application should manage its own infrastructure needs: PostgreSQL database, SQS queue, Kafka topics, etc.
Data ready: environment bootstrap to make it usable for internal users (fake accounts, transactions, bills, etc.)

Changing the project structure

To ensure changes of configuration are independent from a service to another, we backported their configuration into each application Git repository with a unique folder charts which contain the associated Helm charts to deploy the service. It gives engineers more ownership of their applications in production as the configuration is now directly tied to the source code (instead of a central repository owned by the operation team).

Even if it duplicates in some way some shared configuration it also ensures that only a specific configuration is injected into the microservice and prevents unwanted settings to be mistakenly applied to other services. Changing a shared configuration or default value could have impacted before many services at the same time and was an anti-pattern of a microservice architecture.

Managing the infrastructure resources

Instead of using hardcoded pipeline code, we used Helm hooks to bootstrap infrastructure services like PostgreSQL databases or SQS queues needed by a service to run. Hooks are executed before a service deployment and automatically destroyed by Kubernetes. They consist of Python and Bash scripts with associated encrypted permissions directly managed by Helm allowing us to use the same tool to simplify the deployment logic. In some future iteration, we could replace Helm hook jobs entirely with a Kubernetes operator and CRDs to make the creation process easier and more abstract. Some third-party tools already exist like Pulimi Kubernetes operator.

Making the environment data ready

Applications are using seeds to hydrate databases with data so that it can be used by anyone in a production-like environment: developers, product teams, sales teams, etc.

In our previous model, some applications relied on API calls to bootstrap data. It led to increased dependencies that we needed so badly to remove. We began our work by configuring a brand new environment with no data and adding it piece by piece. Then we saved it into individual SQL dumps saved on an S3 bucket. We used the same strategy with our infrastructure Helm hooks for seeds with new Kubernetes jobs consisting of restoring the SQL dumps. It is fast, it meets our goal and we can regenerate them anytime we want.

Assembling the pieces together

Now that we’ve moved all the deployment service logic into Helm, we can deploy them using ArgoCD. ArgoCD deployments are managed through custom resource definitions or CRDs.

Each application is a set of resources, in our case managed by a Helm chart. Each application is then associated with an ArgoCD project attached to a Kubernetes namespace. Below a small example of what an application specification for Helm looks like.

project: qonto-pdf
source:
  repoURL: '<https://gitlab.com/qonto/qonto-pdf.git>'
  path: charts
  targetRevision: master
  helm:
    valueFiles:
      - default.yaml
      - staging.yaml
    parameters:
      - name: global.image.tag
        value: $ARGOCD_APP_REVISION
    releaseName: service-a-master
destination:
  server: '<https://kubernetes.default.svc>'
  namespace: default
syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - Validate=true
revisionHistoryLimit: 1

When a new branch is pushed on Gitlab, the pipeline automatically starts and configures the new Kubernetes namespace, creates the ArgoCD project, runs tests, builds the application, and deploys all services using ArgoCD CRDs. Below a small diagram of the pipeline.

The pipeline is simple enough, some kubectl commands to run on our CI and the rest is directly managed by Helm through ArgoCD. Everything is deployed in parallel and available in a few minutes.

With this new GitOps pipeline, in order to deploy a new environment, it is as simple as the commands below.

git clone <https://qonto.com/qonto/qonto-pdf.git>
git checkout -b [environment_name]
git push
Sleep 180 # Wait 3 minutes
open <https://environment-url>

On the web interface, ArgoCD displays all applications which we can filter by project name (the environment name)

And we can quickly see the Kubernetes resources in a dedicated application view (deployment and animations happens in real-time).

ArgoCD deploys around up to 4000 times a day in about 45 feature branch environments.

Observability

ArgoCD exposes Prometheus metrics. We configured Prometheus to scrape the /metrics endpoint and we added the associated Grafana dashboard. It allows us to monitor our deployments and the environment's health globally to make sure CD operations are working as they are supposed to. We also watch the number of total feature branch environments that we have in order to investigate possible delays or performance issues.

Cleanup

Each Git branch has its own associated environment. Once the Git branch is not used (no activity, merged or deleted), the environment needs to be deleted. We created a custom controller triggered every 10 minutes which uses the ArgoCD and Gitlab API to decide whether it is still used or not. It keeps our cost low while still giving the ability to create many other new environments.

Performance

We had some issues with ArgoCD performance. When we first deployed it, it could not handle more than 1000 applications. It is mainly due to the reconciliation loop and the lack of horizontal scalability. Small iterative releases helped fix issues but we are waiting for the v1.8 release which hopefully will bring drastic performance improvements on instances with a large number of applications.

The internal ArgoCD controller work queue had overflow issues with infinite loop tasks if orphan resources monitoring was enabled on each project. It adds warnings on the application page if a Kubernetes resource is present in a namespace and not managed by ArgoCD. To fix performance issues we decided to drop this feature. Even if it is nice-to-have, we can live without it for now.

Conclusion

Adopting Kubernetes was the first step into managing our platforms in a more abstract way to allow us to better focus on the what rather than the how. However, now with ArgoCD, we took it a bit further by managing all workloads with GitOps, from our own internal tools to the staging, feature branch, and production environments.

The community is large and with ArgoCD workflows or ArgoCD events which are other dedicated tools to manage more complex environments, our journey into operating platforms with GitOps has just begun. We will also make sure to watch for new improvements and contribute to the project.

✅ About Qonto
Qonto is a neobank for SMEs and freelancers founded in 2016 by Steve Anavi and Alexandre Prot. Since the launch in July 2017, Qonto has made business banking easy for more than 120,000 companies. Thanks to Qonto, business owners save time (streamlined account opening before a day-to-day user experience with unlimited history, accounting exports, expense management feature), have more control while giving their teams autonomy (real-time notifications, user right management system) and have improved visibility on cash-flows (smart dashboards, transactions auto-tagging, and cash-flow monitoring). They also enjoy stellar customer support, at a fair and transparent price.

Interested in joining a challenging and game changer company? Consult our job offers!