Hands-on Day 1 and Day 2 Operations in Kubernetes using Django and AKS — Part 3

Ousama Esbel
COMPREDICT
Published in
11 min readMar 18, 2021

Kubernetes has become the de facto container orchestrator due to its various functionalities and flexibility. Although Kubernetes documentation is thorough and provides many examples, it is not straight forward to combine all these tutorials and use it to deploy a real-life application with several services from end-to-end. With that, I will demonstrate how to deploy a real-scenario application on Azure Kubernetes Cluster (AKS) as the production platform. Moreover, I will be discussing day 1 and day 2 operations of the application lifecycle in Kubernetes in series of articles. Here are the main headlines:

  • Discuss the application and set up the cluster, container registry and the production namespace. (part 1)
  • Deploy Config Maps, Secrets and Persistent Volumes. (part 2)
  • Deploy, monitor and define update strategies for the services including setting up Traefik as Ingress Controller. (part 3)
  • DevOps and Auto deployment using Github Action. (part 4)

You don’t have to use Azure Kubernetes Service per say, you can easily re-configure the manifests to be compatible to any Kubernetes installation, such as AWS EKS or Linode. However, as a prerequisite, you need to have a basic knowledge of Kubernetes, Docker, yaml and shell scripting. In addition, if you want to run the application along with the tutorial, you need to:

In this article, the application is launched, where Django and Flower will be exposed to external traffic using LoadBalancer and Traefik Controller. Moreover, both endpoints are secured using TLS certificates that are automatically generated from Traefik. To conclude, day 2 operations are discussed, mainly rolling updates and health checks.

Deploy Services

In this application, there are five different services. All these services are interconnected together through internal network. Hence, each should have an internal IP address. However, two of them should be also accessible from outside, namely Django and Flower services.

All services will be created as Deployments except Postgres where it will be created as a StatefulSet. The Kubernetes manifests of these services look really similar, the manifests should address the following points:

  • Path to the image that we pushed to ACR.
  • Ports that the application is using.
  • The environment variables and secrets.
  • Attached volumes and where to mount.
  • Kubernetes Service to define the policy on how to access the Pod whether internally or externally.

Let’s see how the above points are defined for Django service:

Django Deployment and Service

Django’s container image is defined in line 21. Unlike Docker, you cannot use environment variables within the manifests. Therefore, you need to go to each manifest in compose/kubernetes/*.yaml and change the image URL to match with your ACR. This shortcoming can be addressed by Helm. However, this is a topic for a different day.

Regarding the port, Django is served using Gunicorn where the port is defined in compose/production/django/start. With that, we need to expose port 5000.

Then, we define the environment variables and secrets. In the manifests, we refer to the ConfigMaps and Secrets that we created in the earlier section. Additionally, we add two environment variables that expand on the variables defined in the ConfigMaps.

Next is the volumes’ section, at line 61, where we define what Volume Persistent Claims (VPCs) the pod needs to use (VPCs are created in part 2) and, on line 41, we define where the attached volume should be mounted in the Pod. In our case, we want to mount it as Django’s media folder.

Finally, we configure the service policy which should be defined as a different Kubernetes resource. In Django’s case, we want to expose it to the other services and map port 5000 in the pod to port 80 of the Service. Of course, this is not ideal for production as we want to use TLS for our application. But let’s make this work as a first step and, later, we will modify it, replace the service with Ingress Resource and use TLS.

To deploy the services, run the following commands:

kubectl apply -f compose/kubernetes/.

In our examples, the workers are doing the heavy lifting, so it is purposeful to scale the workers. Accordingly, I sat the replicas in celeryworker.yaml to 3. There are additional sections in the celeryworker and Django that I will address in the upcoming section.

In the first run, the cluster will take some time to pull the images from the ACR. After some time, you should have the following results when running kubectl get pods -n production -o wide

Deployed services.

Traefik as Ingress Controller

One of the main strength of Kubernetes is that it is flexible allowing you to choose any framework for your resources. To connect your cluster to external network, we usually tend to create an Ingress Controller. In AKS, an Ingress Controller can be integrated upon creating the cluster by enabling HTTP external Connection. Azure then will create a Nginx Ingress Controller. Nginx is a very reliable reverse-proxy framework. However, it falls short in several aspects:

  • Difficult to integrate it in Kubernetes cluster.
  • Static configuration for each service in the cluster which makes it hard to manage in a cluster that is frequently changing.
  • Doesn’t support assigning TLS certificates out of the box. It requires another resource to manage certification like Cert-Manager.

Although Azure takes all the heavy lifting integrating Nginx in our cluster, it still doesn’t solve the other shortcomings. That is why I propose using Traefik. Traefik is a modern reverse-proxy that has been first released in the age of containers. Furthermore, it is easier to integrate as it speaks to Kubernetes natively and doesn’t need an extra wrapper to speak with Kubernetes. In addition, Traefik has built-in Let’s Encrypt and fully supports TCP.

There are three different strategies to integrate Traefik:

  • Use Kubernetes Ingress resource to define the paths. This requires Traefik controller to have the correct permissions to watch ingress API and cannot automatically generate TLS certificates for your services. It requires handing over TLS certificates manually or by using another service like Cert-Manager [tutorial].
  • Use Traefik custom Kubernetes resources that allow us to specify Ingresses with Traefik configuration and establish TLS certificates from Let’s Encrypt. In addition, it also requires Traefik controller to have the permissions to watch the new resources [tutorial].
  • Provide Traefik the complete configuration from a ConfigMap without the need to define Kubernetes Ingress resource. In the configuration, you can define any middleware and you can establish TLS certificates.

The first two strategies are complicated to integrate. However, they provide the needed flexibility. For applications that have few access points for which the Ingresses are not frequently changed such as in our case, then it is easier and less error-prone to go for the third option.

Without further ado, first, we need to create a LoadBalancer service for the Traefik controller. The service manifest can be described as the following:

Traefik Service.

In a nutshell, we are asking AKS for an external IP address and accept connections on ports 80, 443 and 5555.

To create the service, run the following command:

kubectl create -f ingress/traefik-svc.yaml

To retrieve the external IP address, find traefik when running the following command:

kubectl get services -n production

Then, map the external IP address to a domain. If you don’t already have a domain, you can use Azure’s App Service Domain to create a domain name. It takes some time for the DNS to be updated. Meanwhile, let’s work on Traefik configurations. Traefik requires two configurations:

  • acme.json: required for automatic certificate generation.
  • traefik.yml: contains all the configurations such as entrypoints, middleware, services and routes.

Here is the manifest for both ConfigMaps:

Traefik configuration and acme ConfigMap

First of all, replace<your-URL> with the domain name you mapped to the external IP address and replace <your-email> with your email. In the manifest, we are creating two ConfigMaps. The first one is acme.json which is simply an empty string. It is really crucial for Traefik to have this file available in order to generate the certificates, even if it is empty. The second ConfigMap is traefik.yml In line 31, we are specifying that we are using Let’s Encrypt for generating TLS certificates and store the needed information in the acme.json file that we should create. In line 41, the routes to the application are defined. Moreover, services, entrypoints and middlewares are specified. For example, Django service requires Traefik to redirect any requests to HTTPs and allow CSRF headers. The following commands create Traefik’s ConfigMaps:

kubectl create -f configmaps/traefik-acme.yaml
kubectl create -f configmaps/traefik-config.yaml

Now, it is time to create Traefik controller. There are two recommended ways to deploy Traefik pod, either as Kubernetes Deployment or DaemonSet. This official article by Traefik explains the difference between them and when to use each. In any case, Deployment is used in this tutorial, however it is fairly easy to swap back and forth between them. Here is the Traefik Deployment manifest:

Traefik Deployment

In the manifest, we are starting the pod by running traefik command with arguments specified in line 25. What is important to note is that Traefik will look at the configuration file in path /etc/traefik/traefik.yml. So we need to mount the configuration file there.

At line 30, several ConfigMaps are projected into the same directory to be passed to the pod as a volume. Moreover, traefik.yml configuration is added to the top of the directory specified in the mounted path. However, acme.json is mounted on acme folder with access mode 0600. The complete directory is then mounted within the pod in the specified path in volumeMounts.mountPath as shown at line 23. And that’s it!

To create Traefik controller, run the following command:

kubectl create -f compose/kubernetes/ingress/traefik-controller.yaml

Once Traefik is up, you can navigate to your domain, you will be redirected to HTTPs and can access the website normally.

Rolling Update and Health Checks

The ability to update your application with zero-downtime is a key feature that Kubernetes offers. The update happens progressively. For example, in deployments, a new ReplicaSet is created where it contains the updated version. At the same time, the old ReplicaSet still exists. Then, a new pod is created in the new ReplicaSet to replace a pod from the old ReplicaSet. This is done incrementally until all pods in the old ReplicaSet are replaced. The update strategy is controlled by the following settings:

  • maxUnavailable: How many resources we can maximum take out from current replicas.
  • maxSurge: How many additional resources we can add to the existing replicas.

In our application, we need to make sure that we have at least three workers up and running at all times, even during updates. In that case, we can set maxUnavailable and maxSurge to 0 and 1 respectively. With that, one pod is added at a time, and that there will always be 3 pods ready to service in the deployment. The following figure illustrates what happens on every step:

Rolling Update one pod at a time. Image is taken from here

As illustrated, once the new version is in “Ready” state, a pod from the old ReplicaSet is terminated. But isn’t “Ready” different from one service to another? For instance, in case of a web server, the service is ready when it can receive HTTP requests, on the other hand, databases are said to be ready when they can accept TCP connection to their open port. Therefore, Kubernetes allows us to define how each service is ready.

In Celery, we can identify the health of its worker and check if it is ready by running the following command:

celery inspect ping -A config.celery_app -d celery@$HOSTNAME

With the above command, we need the pod to warm up for 30 seconds and then try to run the command every minute until it gets at least 1 successful result (i.e., exit with code 0). With this configuration, ideally, the worker will need 30 seconds to be ready and with 90 seconds we would have replaced the three workers. Even if it is good enough for our application, it might be too long for some other cases. We can translate the above requirements in Kubernetes as the following:

readinessProbe:
exec:
command:
- bash
- -c
- celery inspect ping -A config.celery_app -d celery@$HOSTNAME
initialDelaySeconds: 30
periodSeconds: 20
successThreshold: 10

What about the terminated Celeryworker? What happens if the worker is performing a task when the Kubernetes decided to terminate it? Shouldn’t it wait until the worker finishes its task and then shut it down? To deal with this, Kubernetes sends a SIGTERM signal to the container and waits a period of time (default 30 seconds) until the container gracefully shuts down. If the container doesn’t terminate within the specified period, Kubernetes will force it to shut down using SIGKILL signal.

Luckily, Celery provides the desired functionality out of the box. Furthermore, it will make sure that Celeryworker finishes all currently executed tasks once it receives SIGTERM. However, we still need to increase the waiting period to a suitable time that would let the worker finish its task. We control the time by configuring spec.termplate.spec.terminationGracePeriodSeconds in the deployment’s manifest.

Finally, to complete the lifecycle of our app, we need to continuously monitor our app to make sure that it is running correctly at all times. In Kubernetes, this is done by defining health checks probes that act on containers. Moreover, Kubernetes can take actions on containers that fail the health check.

The Readiness probe that we used above is one kind of Health check probes. It informs kubelet agent in the node when a container is “ready” to serve a traffic. If it fails, it will be taken out of service. Another crucial probe is the liveness probe which periodically checks if a container is dead or still alive. If the liveness probe fails, the container will be killed and what happens next depends on the restart policy we specified in the service’s manifest.

For a web server that has to talk to a database, such the case for Django service, we need to separate the readiness and liveness probes. The convention is that if the website is up but the database is down or unreachable, the liveness should pass but the readiness fails. The reason behind this design is that we don’t want to keep restarting the website for another service’s fault. What we need is to take the website out of service until the problem is resolved.

With that, a middleware has been implemented in Django to expose two endpoints, one for liveness and another for readiness, the liveness endpoint will check the status of Django before getting to the point where Django checks the database connection. In this present case, the readiness endpoint checks every external connection that Django makes. For more details check out this article. The liveness and readiness specifications for Django service are listed below:

restartPolicy: always
...
livenessProbe:
httpGet:
path: /healthy
port: 5000 # Gunicorn port
initialDelaySeconds: 30
periodSeconds: 31
timeoutSeconds: 2
failureThreshold: 2
readinessProbe:
httpGet:
path: /readiness
port: 5000 # Gunicorn port
initialDelaySeconds: 30
periodSeconds: 37
timeoutSeconds: 5
failureThreshold: 2

Django service will be given 30 seconds to start up. Then, every 31 seconds a http GET request will be sent to the healthy endpoint. If it failed in two consecutive times, Django will be marked as failure. In return, it will be terminated and a new service will be created. It is done similarly for the readiness probe but with different endpoint and configuration.

Things To Consider

  • Use Helm to automate the deployment of Kubernetes and make the manifests more customizable.
  • Affinity settings should be specified in Traefik’s Deployment to ensure that two pods don’t end up on the same node.
  • Use Azure Postgres Service instead of hosting your PostgresDB in your Cluster.
  • Use Traefik Custom Ingress Resources instead of specifying the configuration using ConfigMaps.

Next

In the next tutorial, I will discuss how Kubernetes cluster lifecycle is managed using Continuous Deployment pipeline in Github Actions.

Clean Up

If you ran the tutorial, please go ahead and delete the resource group, Azure then will delete every resource in that group. Additionally, delete the service principal that has been created along with the resource group.

--

--

Ousama Esbel
COMPREDICT

Head of IT at COMPREDICT GmbH, worked as full stack machine learning engineer. Enthusiastic about AI.