How Qonto moved to Kubernetes

This post is about Qonto’s journey to migrate our infrastructure from DockerSwarm to Kubernetes in order to ensure scalability and security of our service which serves tens of thousands of daily users.

When we started working on Qonto in 2016, Kubernetes was at its very beginning and nobody really anticipated its lightning success. At this time we chose to kickstart our infrastructure with the then leading container deployment solution, Docker Swarm. It worked very well at the beginning but in less than 18 months Qonto went from 0 to 40,000 clients and now deals with hundred of thousands of requests a day. While facing this exponential growth we started to experience some of the drawbacks of this choice.

For instance, Docker Swarm has a very limited permissions system and only a few engineers were allowed to connect on production systems. Like many Docker Swarm deployments, most of the configuration was not encrypted since there is no built-in system to manage secrets (Service credentials, API keys, database access …)–and as a bank this was clearly a limiting factor for us. Finally, there was no out-of-the-box scaling features. These are some reasons why we decided to start to migrate to Kubernetes.

Kubernetes migration was a long trip and our learning curve toward building a high availability infrastructure was full of pitfalls and challenges. Here is our story.


Step 1: Face and tackle the technical challenges

As you may expect, we faced a lot of various technical challenges when we started to dig into this migration work. Here are some of the biggest technical challenges and questions we had to overcome to reach our goal.

How do we deploy Kubernetes cluster?

The first step was to choose a Kubernetes deployment tool. The maintenance of a Kubernetes cluster could be time consuming and needs dedicated engineers if you do it with the hard way.

Qonto infrastructure is hosted in the AWS Europe regions. When we started the Kubernetes migration project, AWS EKS was not available in European regions and migration to another cloud provider could be a huge challenge. After benchmarking several deployment tools (Kops, Rancher, …) we selected Kubespray.

Kubespray is a collection of Ansible playbooks that deploy and configure all Kubernetes components (including ETCD cluster). We already used Ansible to manage our AWS infrastructure and systems, so the team was comfortable using it.

How to lower downtime during Kubernetes upgrade?

The Kubernetes cluster upgrade process is neither friendly nor graceful, which is why we chose to deploy our Kubernetes clusters with blue/green deployment method. Instead of upgrading Kubernetes components (Kubernetes API, Kubelet, ETCD cluster, …), we are deploying a brand new cluster for each Kubernetes upgrade. It’s quite automatic since we are using Ansible to deploy on AWS EC2 infrastructure.

Kubernetes released 4 major versions while we were migrating

We recently switched from the “green” to the “blue” cluster during the Kubernetes 1.11 upgrade. It was convenient to be able to ensure that the service worked well before migrating our production traffic.

As an abstract layer for our developers, we defined a DNS record, which targets the current active cluster, so cluster migrations are fully transparent for our engineers and don’t require any changes on their configurations.

How to keep settings and secrets secure?

We quickly chose to use Helm to deploy Kubernetes resources. We keep settings and secret configurations outside of helm charts for security purposes. We developed a simple tool to inject configurations and secrets to Helm during deployments.

One of our main concerns about the Kubernetes migration was to encrypt application secrets from end-to-end since Kubernetes has a built-in secret management feature.

Kubernetes stores secrets in ETCD that is not encrypted by default

We use Mozilla SOPS to encrypt our secrets in Git repositories. It’s very helpful since secret keys are searchable in clear text and encrypting/decrypting operations are allowed by AWS KMS.

In our legacy deployment stack, developers were not able to see production settings. With this system, our developers can see and edit settings via Git pull requests and see secret keys. Only the DevOps Team can change secret values. It is really helpful to enforce security policies and role isolations at Qonto.

How do we process logs?

As developers already used Kibana to visualize their applications when running in Swarm, we decided not to change their habits and stick with it. With this prerequisite, we needed to find a software that can:

  • Communicate with our existing ELK stack
  • Be lightweight and easily run in a container as it will be treated like other software
  • Be stopped at any time without losing logs as it will be a constraint to the container lifecycle

With that in mind, we decided to go with Filebeat from elastic.co running as a DaemonSet in each Kubernetes cluster. Filebeat auto discovers and reads the log files created by the Docker daemon and forwards them to Logstash. Filebeat adds metadata about our cloud provider (AZ, instance id) and Kubernetes (labels, namespace, etc.).

Final design of our “FELK” stack

For high availability and scaling purposes, we configured a load balancer in front of Logstash instances but since Filebeat established long-lived TCP connections, each Logstash restart led to a reshuffling of all connections to only one instance.

To mitigate this, we enabled a Filebeat TTL option to initiate connections every five minutes.

How to manage permissions at scale?

We enabled Kubernetes Role-Based Access Control from day one to enforce permissions policy. There is no built-in authentication system in Kubernetes but a mechanism to use external authentication system (eg. Static file, X509 Certificate, OpenID), we chose to use OpenId connect integrated to our Google App account.

We are using k8s-oidc-helper to generate OpenId tokens in a transparent way for kubectl commands. Obviously, our Kubernetes clusters can only be reached from our VPN.

We defined ClusterRole with specific permissions for each team. It is a major enhancement regarding our legacy since only a few people were authorized to connect on Swarm clusters. For instance, our product managers may execute some Ruby commands on staging environments to generate specific cases for their QA.

Last but not least: how do we manage Kubernetes monitoring?

We chose Prometheus operator as main Kubernetes monitoring solution since it deploys a full Prometheus environment (Prometheus, Alertmanager, exporters and Grafana) with many relevant built-in alerts in each cluster.

The provided Grafana dashboards helped us to troubleshoot our early deployments and to size our cluster capacity. There is a straightforward dashboard displaying the max capacity of the cluster over the sum of all your CPU/memory container requests.

Having one Prometheus per cluster differs from our original monitoring architecture where we had one Prometheus server hosted in an EC2 instance managed by Ansible. Its usage also forces the team to rethink the way it manages alerts using the new Custom Resources provided by the operator. We are now considering long-term metrics retention solutions.


Step #2: Let’s start the migration process

One of the main expectations of the migration was to switch to Kubernetes in a seamless way for the clients of the DevOps Team: Qonto users AND Qonto developers. That’s why the migration itself took several months to complete even if all the technical points were already tackled.

The first step of the migration was to adapt Qbot, our release Slack bot to be able to put new versions in production of Qonto applications on Kubernetes — the same way we did on Swarm.
Since Qbot launches Jenkins jobs, all we needed to do was to rename job names, thus developers kept the same interface.

Qbot deployment Slack message

Once done, we started to migrate the staging cluster. This allowed us to spot and fix the first problems we faced with our Kubernetes use (more information about how we deal with problem fixing can be found in this article). For three months, we froze new modifications to the cluster to be sure that we did not miss any underlying long-term problems before moving to the production migration.

We only deployed stateless microservices at first, not the cronjobs since their code was not idempotent (we could not have a cronjob deployed twice at the same time).

The cluster switch itself was straightforward. Each application communicates through HTTP and all we needed to do was to change the application CNAME DNS record in Route 53 to the new load balancer and traffic shifted! To be sure we did not miss anything, we ran QA scripts.

The final step was to deploy workers, cronjobs and Qbot itself!


Step #3: Fix when things go wrong!

Even if the migration was overall a huge success, we had some failures during the process.

DevOps = Dev + Ops

The first fail is not technical but reflects the split between developers and infrastructure teams. Even if Kubernetes is a service we provide for the developer teams, the migration work has to be done by members of the DevOps team.

This leads to frustration on our side when we needed to be sure that apps were running correctly but did not have end-to-end tests. We developed Postman scripts for each application to ensure all services were fully functional.

To reduce a knowledge overload, we schedule tech exchange meetings and workshops to introduce new changes.

Issues with deployment history

Since we are using the “feature branch” development method and deploy a full staging environment (one instance of each component) per development branch, our staging uses more resources than our production cluster!

We had several issues with Kubernetes API related to massive and concurrent deployments.

Kubernetes API was restarting frequently since it reached its memory limit (OOM kill). We tried to figure out if it could be related to the number of pods or namespaces but it was not very relevant. We also tried to change Kubernetes’ API server pod memory limit and increase its control plane memory but it didn’t change the situation.

We finally found out that Helm keeps a history of each deployment for rollback options and Kubespray defines Helm max history limit to “unlimited” by default. This led to us having a hundred revisions in just a few days. Kubernetes API keeps cached information in its memory. So we fixed our problem by changing the tiller_max_history parameter (from “-1” (unlimited) to “3”).

When Docker sinks

We had some issues with Docker daemon itself in our journey to Kubernetes. We used Docker 17.03.2 which was the latest Docker version officially supported both by Kubernetes and Kubespray at that time.

We randomly had deployment failures due to an odd Docker error in our staging environment:

Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "api-master-6d4899c47d-cnrrv": Error response from daemon: grpc: the connection is unavailable

After a few attempts to reproduce this issue, we identified a bug fix in the Docker 17.06.02 release notes, which could be related to our frequent docker container start and stop operations. Version 18.06.1 was already there and fixed many issues but was not officially supported by Kubernetes at that time. We chose to bypass Kubernetes recommendations and use the 18.0.6.1 Docker version (this Docker version was validated in Kubernetes 1.12.0). This solved our problem immediately.

The point is that there are still bugs in the Docker engine and just like Kubernetes, you should keep your Docker version up to date and try the unsupported Docker version.

Pod scheduling issues

At one point during the migration, we discovered that no default affinity policy applied to containers.

This issue led to a downtime on the registration service when one of the cluster nodes collapsed and every replica was located on it.

To mitigate this we use the pod affinity constraint. Therefore, two pods of the same replica cannot be co-located on the same host. It helps us now, but we know that it could lead to inconsistency in the future (e.g when a service would have more replicas than the number of nodes in the cluster).

affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
— labelSelector:
matchExpressions:
— key: app
operator: In
values:
— {{ $appName }}
topologyKey: “kubernetes.io/hostname”

To stay on the scheduling topic, one thing Kubernetes looks at when scheduling pod is the resource available on worker nodes. To facilitate the scheduler, you can specify resource limits and requests for your pods in their manifest.

apiVersion: v1
kind: Pod
metadata:
name: api
namespace: master
spec:
containers:
— name: api-web
image: 593856183654.dkr.ecr.eu-west-3.amazonaws.com/api:bf14bd3
resources:
requests:
cpu: "1m"
limits:
cpu: "3"

Now, what’s next?

We finally migrated our last applications into Kubernetes in January. However, we’re still at the beginning of our journey with Kubernetes!

Now we can focus on Kubernetes advanced features. For instance, we are excited to work on pod auto scaling to handle our monthly invoices generation and pod security policies to enforce a high-security standard for Qonto applications.

Do you like Kubernetes too? Have a look at https://qonto.eu/en/careers

Article co-written with Alexis Sellier, Senior Devops at Qonto.