Zero downtime deployments of .NET Core Web API on GKE

Vignesh Rajasekaran
Titansoft Engineering Blog
5 min readOct 7, 2019

In this post we’ll be talking about the challenges faced in achieving zero downtime deployments for our .NET Core API’s deployed on Google Kubernetes Engine(GKE).

Photo by Toby Christopher on Unsplash

The architecture of our deployment was fairly straightforward as seen below.

Standard web API deployment on Google Cloud Platform

There are multiple options for ingress controllers. In our initial deployment we decided to use GCE Ingress to simplify our setup by following the recommended approach by Google Cloud.

According to Kubernetes docs, zero downtime is achieved through the rolling update strategy of deployment. The whole process is summarised in the animation below.

Since the documentation stated that zero downtime can be achieved, our initial deployment was an extremely naive approach and looked like this.

apiVersion: apps/v1
kind: Deployment
metadata:
name: product-api
spec:
replicas: 3
selector:
matchLabels:
name: product-api
template:
metadata:
labels:
name: product-api
spec:
containers:
- name: product-api
image: product-api:1
env:
- name: "ASPNETCORE_ENVIRONMENT"
value: "Prod"
resources:
requests:
cpu: 0.5
memory: 0.5Gi
limits:
cpu: 0.75
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: product-api
spec:
type: NodePort
selector:
name: product-api
ports:
- port: 5000
targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: product-api
spec:
backend:
serviceName: product-api
servicePort: 5000

We didn’t face any issues with this during our tests. However, when deploying services to production we noticed that the client would always receive HTTP 502 errors. This obviously meant that Ingress was receiving some errors from the pods. Since this happened during a deployment, we suspected that the problem must either be from the newly created pods or the terminating pods.

We decided to look into why requests to the newly created pods might fail. At this point we learnt about Pod Readiness status which is described from the documentation below.

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don’t want to kill the application, but you don’t want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

So the first change we made was to add the readiness probe into the deployments.

        readinessProbe:
httpGet:
scheme: HTTP
path: /api/status
port: 5000

After adding this, we noticed that the errors seemed to decrease but we still had 502’s. So we decided to look at the terminating pods. Again, the documentation does an excellent job of summarising this. Below is a sequence diagram which captures the essence of the idea.

The thing to note here is the asynchronous order of steps #1 and #2 in the diagram. Pod termination and removal of pod from service endpoints happen concurrently. I think this is due to the event driven nature of Kubernetes. This post does a good job of explaining that in more detail. The endpoint-controller and kubelet listen for events such as pod termination and take the necessary action. The result of this is that we don’t have a sequence of events as we initially expected it to be. We found another post talking about something similar and hence we suspected this could be happening to us too. So we added the delay to our deployments.

    lifecycle:
preStop:
exec:
command: ["/bin/sh", "sleep 20"]

We did cringe seeing arbitrary numbers in the yaml file but we left that demon for another time. Running the tests after this showed that the results were getting better but we still did not manage to achieve zero downtime as there were intermittent 502's.

At this point we were left scratching our heads as to what else should we do.

We decided we needed to go deeper to figure out what’s going on. We started off reading about how services work. Kubernetes docs to the rescue again!

https://kubernetes.io/docs/concepts/services-networking/service/

In the version of GKE that we were using(1.12), ip-tables-proxy was how services were implemented. We also read that ip-table rule changes are bypassed by existing connection for performance reasons. This led us to suspect that connections between Ingress and the nodes were not getting updated when the service endpoints changed.

One fix was to remove persistent connections from the Ingress to the nodes but this wouldn’t be possible because we used GCE Ingress which was a managed service by Google and we couldn’t tweak this setting. Even if we wanted to tweak, it would hurt performance so it was not necessarily a bad thing that we couldn’t do so.

Luckily, we noticed that GCP had a Container Native Load Balancing approach which meant traffic from Ingress directly went to the pods skipping the service. This meant that the service endpoints updated the load balancer directly.

Regular load balancing vs Container Native Load Balancing

This setting could be tweaked as below.

apiVersion: v1
kind: Service
metadata:
name: product-api
annotations:
cloud.google.com/neg: '{"ingress": true}'

And voila, zero downtime!

In summary, we had to do the following to achieve zero downtime in our setup,

  1. Readiness probe configuration
  2. Pre-stop sleep to ensure service endpoint rule progression
  3. Container native load balancing

Like I mentioned before, we were not fully happy with the sleep to achieve propagation of endpoint updates across the nodes. This is definitely a smell and we wanted to remove that. Also, we felt that GCE Ingress did not have features that we required from an Ingress/LB so we decided to look into other Ingress controllers.

At the time of writing, we have already moved on from GCE Ingress to Nginx Ingress. But that’s for another post! :)

--

--