Was zero-downtime just a dream?

Our journey to flawless graceful shutdown

Emre Tanriverdi
Trendyol Tech

--

This story is about Kubernetes & Istio.

If you are not interested in the journey and just came here to learn how we solved our issue, there is a TL;DR at the end of the story.

istio.io

As Homepage & Recommendation Team, we lately have been working on improving our deployment experience in Trendyol.

Here you can see our journey so far:

Blue-Green or Canary? Why not both?
How did we implement smoke tests for our CI/CD process in Trendyol?

So now we have Canary Deployment with smoke tests and anomaly detection with automatic rollback.

Things are perfect. Everything is smooth.

Is it though?

Recognizing the problem

It all started after a series of performance tests.

All apps were scaled up while we were reproducing the usual high-load traffic. When the tests were over, we downscaled the apps.

Then our Slack alert channel produced this message:

This was unexpected for us because we knew our apps were gracefully shutting down.

Then, to narrow down our research, we tried rolling out a new deployment to see if this problem occurs there too, or is it just specific to scaling down.

Okay, maybe it was less, but we knew the issue was also there.

Acknowledging the problem

Initially, we started to talk about this as a problem.
Is it really a problem though?

Here is a hypothetical scenario with my inner-thought persona, John:

Probably no one will bat an eye if you just tell them they might have errors because you are just deploying something.

But this is not ideal for a situation where you are rooting for a
Continuous Deployment mindset.

Instead, the scenario should be like this:

and John is right.

Now we know there is a problem that needs to be fixed.

Deep-diving the problem

We are using Istio for traffic management in Trendyol as Gökhan Karadaş mentioned here.

We realized that we don’t get these errors when we simply use nodePort to call our project. So there must be a problem in our virtual service definition.

Solution attempt #1

We realized we are using service discovery but there is no service discovery definition in the “hosts” section. Therefore we are not using Istio correctly.

So we added it.

We were confident that this would solve our issue.

Turns out, for internal requests, you should be using the “mesh” keyword as a gateway, otherwise traffic distribution won’t work correctly.

And now let’s see the results.

This reaaally smoothed the process. Almost 97% improvement.

Okay, there was something missing in our virtual service and we found it.
But we couldn’t stop here. We wanted zero downtime deployments after all.

At this point, we were hopeless and started to believe a zero-downtime deployment will be just a dream for us.

Our gateway

Solution attempt #2

After endless research, many days of checking logs and our rollout.yml file, we realized that the Istio container is dying way sooner than our app’s container.

We were shutting our app gracefully, waiting for the connections to finish and return to the caller but Istio was cutting off the connection, therefore returning a failed response.

We simply thought: “Let Istio wait for our app to terminate?”

There were solutions such as getting inside the Istio container and checking if our app container has requests coming in and out, and when it’s over, closing the Istio connection.

But that was too complex and required us to get inside the Istio container, which we didn’t want to do (or even didn’t know how).

Then we found terminationDrainDuration provided by Istio, which is a config flag you can simply add to your rollout.yml file via annotations.

This could delay the Istio container’s shutdown time.

We applied that to our rollout.yml and made it consistent with our terminationGracePeriodSeconds that’s configured on our app container.

The numbers here would be simply the time you would need before your app completes its final requests.

Then we deployed it, upscaled & downscaled it.

Interesting, our alert channel is quiet.
Let’s write a query for our error logs:

Voila!

Sleep your app container too

Hours and days have passed and things were good overall.

Then we saw that even though we mostly get zero errors, we occasionally get 5–10 errors too, which seemed randomly to us.

We were disappointed. Maybe it wasn’t much and wasn’t something to feel distressed about. But we came up until this point from 1000+ errors and we wanted it all to be over.

After hours of research and debugging, we found the last problem for us.

We should’ve configured our app container to wait for its requests to finish before it terminates itself.

But why?

When your container receives a termination signal, it can terminate itself before its all usages are complete. Kubernetes will not be checking the internals of your app to see if it’s still processing requests or not.

Especially if you are gracefully shutting down your application, this is more of a problem since it will take a longer time for your app to complete its requests.

Plus, deletion of a pod and removing its reference from the Endpoint always work in parallel. So a pod can sometimes be deleted before its reference is removed, resulting in requests being sent to a non-existent pod and fail.

Check here for more information: learnk8s.io/graceful-shutdown

What can we do?

Wait a bit before termination. Therefore we can give it an X seconds of time to act like nothing happened, then start the killing process of the container.

The X seconds here depends on how long your app will continue to complete its all final requests. It’s up to you to find the sweet point.

How can we do it?

Just add a preStop hook to sleep for X seconds (we did 30) before termination and it’s done.

Check here for more information: kubernetes.io

TL;DR

  1. Make sure your virtual service definition is correct.
  2. Make sure Istio container isn’t cutting off the requests from your container while it’s still processing requests.
    (When the termination signal is received, your app may wait up until its requests are complete but Istio container may cut the connection earlier.)
  3. Make sure your container isn’t terminating itself while your app is still processing requests.

The reason me and Yusuf Kürşad Kaya wrote this story is to lend those, who want to do a similar thing in their projects, a helping hand.

We hope it was helpful. :)

Thanks to Gökhan Karadaş for his support in the process.

Thank you for reading! ❤️

Thanks to our colleagues in the Homepage & Recommendation Team. 🤟

--

--