Our Kubernetes journey at TheFork

Published in

TheFork Engineering Blog

12 min readNov 30, 2021

We started our journey with Kubernetes 2 years ago, when we decided to move our workloads from chef and virtual machines into docker and Kubernetes. This migration took quite some time (almost two years !) and was full of challenges and learnings.

Two years seem crazy, but you need to deal with some constraints when doing such migrations: tooling (logging, monitoring, alerting, backup), deployment process (generic chart, a way to store your release values), only to name a few.

Also, when you are migrating 150+ applications, 25+ databases, and some other components, you need to automate a few things, which takes some time. This blog post will tell you our Kubernetes story, what we did and what we learned during our journey, especially when you have to operate it in production.

Disclaimer: some parts will be only relevant for EKS. You can see this article if you want to learn more about the set up of our AWS infrastructure.

Cluster setup

Before even deploying our first app in our Kubernetes cluster, we need to take care of a few things:

AWS RBAC: configmap aws-auth deployment. This critical file manages RBAC permissions for your cluster (EC2 nodes. IAM users, …). We are templating this file to have the proper content depending on our cluster using consul-template command with the -once option. (mainly to add some conditions depending on the environment) This file allows the ops team, our on-call people, and our EC2 the proper access to our Kubernetes API.
A service mesh (optional): On our side, we have a dedicated git repository that contains our Istio configuration. Istio setup will not be elaborate here because it deserves a dedicated blog post.
Persistent volume management: Dealing with persistent volumes can be time-consuming, especially when you have to deal with multiple Availability Zones. On our side, we manage persistent volumes and storage class (when possible) with a combination of helm template with the option --output-dir to generate a valid yaml file which is then applied using kubectl.
A DNS cache mechanism (reference): By default, there aren’t any DNS cache mechanisms, so that’s probably something you should have a look at to reduce the load on your core-dns infrastructure.

Number of DNS call group by dns provider: nodelocaldns (42K reqps) vs kube-dns (956 reqps)

AWS daemonset customisation: You probably want to set AWS_VPC_K8S_CNI_EXTERNALSNAT to true (We had some DNS latency issues because of this parameter set to false when our applications started some databases connections in parallels, see the github issue )

[bash] cat k8s/patch-aws-snat.yaml
spec:
  template:
    spec:
      containers:
      - name: aws-node
        env:
        - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
        value: "true"[bash] kubectl patch daemonset -n kube-system aws-node --patch "$(cat k8s/patch-aws-snat.yaml)")

Autoscaling: We mainly use cluster over-provisioning in combination with cluster auto-scaler to handle our workloads. It automatically scales our number of pods and EC2 based on our needs.
Logging + monitoring, but we will go back to this in the next part.

We are using a Makefile to handle all these steps using different targets (we automate our terraform process quite well, but for Kubernetes, we choose to do it manually, in the beginning, to get comfortable with). It is working pretty well when you have got a few clusters! But we are reaching some limitations. (it can be scary for our new joiners to play with it, changes have to be deployed manually for each cluster, hard to keep track of the recent updates) We are now automating all these processes to ease our cluster creations (Jenkins job + argocd coming to the party !).

So from now, we have got a running Kubernetes cluster, which is ready to have some applications to deploy on it, so it is time to deploy a few things!

What is in the box?

The best way to learn Kubernetes is by using it and iterating steps by steps to add missing pieces.

In our cases, some tools were required even before starting to think about migrating our applications. We are using helm to deploy everything, and you can find below a quick summary of what we did:

Vault, to handle our secrets. At that time, back in 2019, there was not a native chart that was doing the whole setup, so we decided to invest in a specific chart, with a custom setup: 1 helm release per Availability Zone (vault-za, vault-zb, and vault-zc), and the same for the backend storage (consul-za, consul-zb, consul-zc). This setup can appear to be a bit complicated, but it was relatively helpful to avoid any service disruption during an upgrade and be more comfortable during such operations. Operating statefulset can be a bit scary, so it can explain this complex setup (we can break an upgrade of a helm release, it will not have any impact because the quorum will still be respected). It is neither a standard nor a recommendation to do such things, but it can help at the beginning of your Kubernetes journey (our services are part of another release)

Jenkins, to manage both CI + CD. We tried to bring high availability to avoid any service disruption, but it is not built-in in Jenkins, and plugins are outdated to bring such features. You can take a look at jenkins x if it is something that matters, but in early 2019, the project was too young for us to be something worthy. It had too many components to set up and some concepts that we did not like. For example, using jx to create an EKS cluster that we are doing as code. Jenkins has its own dedicated Auto Scaling Groups (with only one EC2 on it)
Grafana + Prometheus + alertmanager, for the monitoring part. We are collecting metrics from multiple sources. (Kubernetes, application metrics, istio, cloudwatch only to name a few) On the architecture side, we have one Prometheus instance per cluster with its Elastic Block Store. A promxy is set up in our tooling infrastructure to have only one Prometheus Datasource in grafana, and let us filter by environment using a simple label. We did not invest in long-term storage or on high availability because Thanos and others projects that bring these features to Prometheus were a bit too young. Also, a few more gigabytes by Elastic Block Store cost way less than the setup and maintenance of such solutions, so we accept this compromise. Prometheus also has its dedicated ASG on our production, staging, and tools cluster. A dedicated Prometheus is located in our production and staging environment to handle istio telemetry since the v2. It doesn’t contain any persistent volume, and our main Prometheus is federating him. (reference for such setup). The last piece is a dedicated Prometheus that scraps the others to notify us if one of our instances is down.

Kibana + Logstash + Elasticsearch, for the logging stack. Filebeat is deployed as a daemonset to collect logs for every EC2 when Logstash is used to parse and enhance our logs before indexing them into Elasticsearch. We decided to keep our old indexation mechanism to ease the transition (Filebeat + Logstash + Elasticsearch + Kibana) instead of relying on Fluentd. We also decided to create a new pattern regarding our index:

CONTAINER_NAME.COMPONENT.ENVIRONMENT.DATE

This pattern allows us to index logs outside of the Kubernetes world (component means namespace for our Kubernetes applications). It is also easy to find our application logs because our container name matches our application name.

With our tooling complete, we can now start looking at how to deploy our applications!

Umbrella, or something like that

In the beginning, we wrote a generic umbrella chart (umbrella definition, also, a worth reading article about it), to handle both our dependencies and our deployment. It just did not work with our setup. We used to have memcached + varnish chart as a dependency in this umbrella chart, and our deployments were in the templates folder of this chart. It was just too hard to maintain it.

Umbrella chart update can either be a sub chart update (example: bump of varnish chart version) or a new feature of our templates (add support for ClusterRole). You cannot update varnish without having to update the umbrella chart as well. Let’s take a concrete example: Need to update our varnish chart? One Pull Request on our varnish chart repository, tag it to deploy it in our chart museum, then we need to do the same on our umbrella chart and finally update the project that is using it. Too many steps for something that should be straightforward!

We choose to apply a software development pattern (privilege composition over inheritance) to our chart dependencies and remove it from our umbrella chart. This chart only has to focus on our template files (deployments, service, hpa, istio objects, with some others Kubernetes components, a proper readme, and even a changelog!). Even if the name is not accurate, at least its responsibilities are.

To handle chart dependencies and our 150+ values as well, we introduce a new component in our infrastructure that is powerful and amazing. Let me introduce you helmfile!

Number of releases deployed using our generic chart in our staging cluster

Helmfile for the win

Dealing with multiple charts and a few values for each is different than managing one generic chart, a lot of values, and a list of chart dependencies. Our generic makefiles were reaching their limit.

Before deploying our first application, we decided to invest in helmfile to handle our values for all of our different applications. We structure our helmfile repository using the following conventions:

shared/common.yaml: contains our generic helmfile.yaml, which defines our repositories and our templates
projects/NAMESPACE/APP/helmfile.yaml : helmfile of the application, that define the list of releases that need to be deployed for this application (the umbrella one and varnish or memcached if needed)
projects/NAMESPACE/APP/APP-values.yaml : default values for our application.
projects/NAMESPACE/APP/APP-{staging,production}.yaml : to handle environment specific values
projects/NAMESPACE/APP/varnish-{staging, production}.yaml: to handle environment-specific values for varnish (replace varnish by memcached if you need a memcached)
projects/NAMESPACE/APP/varnish-values.yaml : Default values for our varnish dedicated to this application. (replace varnish by memcached if you need to configure memcached)

We also decided to add -application in our application release name, so we have a way to dissociate our helm releases and logs between middlewares and the application. For example, one of our applications is called thefork-api and uses varnish and memcached, so we have:

3 releases: thefork-api-application, thefork-api-memcached, and thefork-api-varnish.
We can find logs of the application using the index pattern thefork-api-application.* in Kibana, or thefork-api-varnish-ncsa.* for our varnish logs.

The deployment is just a Jenkins job, that launches a make command to deploy our application, with something like :

SHOW_SECRETS=0 KUBE_CLUSTER=aws-staging ENVIRONMENT=staging REVISION=v1.2.3 APP=thefork-api make deploy
# this make command just call helmfile apply

Even if this process is working well, we are working on integrating argocd in our stack to facilitate the introduction of new environments (multiple staging environments, and a better inventory of what is deployed and where, for example)

How to deal with environment variables

Our generic chart allows us to define environment variables into our Kubernetes deployments, but it is not user-friendly.

Also, environment variables act as a proxy between application code and their respective running states in different environments, so it was essential for us to have a dedicated space for that. You can find below the structure of this dedicated git repository:

APP/aws-.env-production.tmpl: production environment variables, using a simple format KEY=VALUE. The value can also come from our vault, thanks to consul-template command.
APP/aws-.env-staging.tmpl: The same for staging.

A great feature of helmfile is its capability to define hooks executed before the deployment. So we just had to create a generic hook that does the git clone of our environment-files project, consul-template the corresponding file, and then format it! So we updated our default template to handle this case.

You can find our common.yaml (light version) with one of our application’s helmfile here:

# shared/common.yaml
repositories:
- name: stable
  url: https://charts.helm.sh/stable
...templates:
  # this template is never used directly
  common: &common    # this version doesn’t exist, you MUST specify a valid version inside your helmfile
    version : "0.0.0"    # don’t fail if one file is missing
    missingFileHandler: Warn    # atomic upgrade
    atomic: true    values:
    # ie: thefork-api-values.yaml, ...
    - "{{ .Release.Labels.applicationName }}-values.yaml"    # ie: thefork-api-staging.yaml, ...
    - "{{ .Release.Labels.applicationName }}-{{ .Environment.Name }}.yaml"
...
    # environment-files
    - "build-umbrella-{{ .Release.Labels.applicationName }}/secret/env-secret.yaml"

We can define a generic hook to be executed before the helm call and generate a values files using our environment-files repository:

# shared/common.yaml part2
# generate values from environment-files
    hooks:
    - events: ["prepare"]
      command: "/bin/bash"
      showlogs: true
      args:
      - -c
      - |
         set -euxo pipefail;
         export APP_NAME="{{ .Release.Labels.applicationName }}";
         export ENV_FILES_REPO="build-${APP_NAME}-environment-files"
         export ENV_FILE="${ENV_FILES_REPO}/${APP_NAME}/aws-.env-{{ .Environment.Name }}.tmpl"
         export TMP_FILE="build-${APP_NAME}-env-secret-tmp.yaml";
         export DST_FILE="build-umbrella-${APP_NAME}/secret/env-secret.yaml";
         
         if [ -d "${ENV_FILES_REPO}" ]; then
             echo "[${APP_NAME}] [ENVIRONMENT-FILES] cleaning existing env-files found in ${ENV_FILES_REPO}";
             rm -rf $ENV_FILES_REPO;
         fi         # clone and generate our values using our vault
         git clone git@something.git $ENV_FILES_REPO;
         if [ ! -f "$ENV_FILE" ]; then
             echo "[${APP_NAME}] [ENVIRONMENT-FILES] no environment-variables found, no values will be generated";
             exit 0;
         fi         # we can probably get ride of consul-template since the support of https://github.com/variantdev/vals
         ENVIRONMENT={{ .Environment.Name }} consul-template -vault-retry-attempts=1 -log-level=info -vault-renew-token=false -template $ENV_FILE:$TMP_FILE -once;
         
         # generate a valid value file for helm
         echo -e "— -\nsecretEnv:\n\n" > $DST_FILE
         cat $TMP_FILE | sed -E 's;^([a-zA-Z\-\_0–9]*)=(.*)$; \1: \2;' >> $DST_FILE# the cleanup hook has not been copied for visibility

With our default template properties set, we can instantiate it :

# shared/common.yaml part 3# generic chart to deploy every app
umbrella: &umbrella
  chart: chartmuseum/umbrella
  labels:
    chart: umbrella
  <<: *common# generic chart to deploy every memcached
memcached: &memcached
  chart: chartmuseum/memcached
  labels:
    applicationName: memcached
    chart: memcached
  <<: *common

Now that our template is defined, we can define an application that is relying on it!

# projects/thefork-api/thefork-api/helmfile.yaml
# load environments
bases:
- ../../../shared/environments.yaml{{ readFile "../../../shared/common.yaml" }}# set the revision to be deploy (docker tag image)
{{ $REVISION_FORMATTED := requiredEnv "REVISION" | replace "/" "-" | replace "_" "-" | replace "(" "-" | replace ")" "-" }}{{- if eq $REVISION_FORMATTED "master" -}}
{{- $REVISION_FORMATTED = "latest" -}}
{{- end -}}{{ $ENVIRONMENT := requiredEnv “ENVIRONMENT” }}releases:
- <<: *umbrella
  version: "0.0.57"
  name: thefork-api-application
  namespace: thefork-api
  labels:
    applicationName: "thefork-api"
  set:
  - name: image.tag
    value: {{ $REVISION_FORMATTED | quote }}
  needs:
  - thefork-api/thefork-api-memcached- <<: *memcached
  version: "0.0.1"
  name: thefork-api-memcached
  namespace: thefork-apiConclusion

Conclusion

A fun story before concluding: few months ago (ok, it was more than one year ago), a coworker did a lambda that terminated every EC2 based on a specific label. But the behavior on the Jenkins job was different from what he tested on the aws console. When its lambda triggered, every EC2 from our tooling environment went down, and we felt it like a complete blackout. Remember, it is a fun story, so after few minutes, everything starts to recover except our vault (our auto-unseal mechanism relies on the fact that at least one vault pod will always be alive), so we just had to unseal it. Except for this expected manual action, everything recovers smoothly without any issues.

Kubernetes is great, really, and brings many benefits, but the integration cost can be huge.

Some pieces of advice that we can share regarding the adoption of such technology:

start simple,
always have a single source of truth,
convention matters,
minimize the number of git repositories you have (especially having every terraform resource in one repository should make your upgrade less painful)
automate all the things you can, even if relying on a makefile is a first step, an automated process (aka a Jenkins job) is always better than a manual action.

Being in control of your technical debt and iterating is the key to success, so choose your tooling carefully! Also, always take into account the maintenance cost because Kubernetes lifecycle support only lasts one year! On our side, we are doing an upgrade once a year (during winter) using a blue-green deployment, but who knows, that may be the topic of our next article!