Graceful shutdown of fpm and nginx in Kubernetes

Published in

Inside Personio

5 min readNov 18, 2020

When it comes to microservices in Kubernetes, the reliability of network calls is very important. With a growing number of microservices and an increased number of deployments happening, the chances of request failures during deployment is very high. In this article, we will share how we reduced the number of daily HTTP 5xx errors by 80% by implementing graceful termination in one of the legacy PHP services that we are still maintaining while it is being migrated to Kotlin.

What’s inside the PHP service pod?

The pod of the PHP service is composed of 2 containers: nginx which is the reverse-proxy web server and php-fpm which is responsible for handling the request.

How did the errors happen?

At Personio, we keep a close eye on the failures happening on the HTTP requests between microservices. Recently, we faced an increased number of sporadic HTTP failures with response code 502 — Bad Gateway when communicating with a legacy PHP service from one of the Kotlin microservices. Fortunately, the HTTP client which we use across Kotlin microservices was implemented with an exponential backoff retry strategy, so the use-case did not actually break.

We found that:

Requests were failing during a deployment or when HorizontalPodAutoscaler auto-scales pods.
During that time the pods were in Terminatingor Terminated state.

This made us suspect that the pods are not gracefully terminated.

Theory: Pod deletion in Kubernetes

To fix the issue, it is important to understand how Kubernetes terminates a pod. When a pod is deleted, Kubernetes has to:

Remove the IP and port of the corresponding pod from the Endpointobject,
Notify kubelet which terminates the containers associated with the pod,
Notify kube-proxy which removes the IP and port of the corresponding pod from the iptables.

When the API Serverreceives the request to delete the pod, it updates the state of the pod to Terminating. The kubelet is notified of this event, which in turn sends SIGTERMto the corresponding containers associated with the pod.

When the Endpoint object is updated in the control plane, the event is published to kube-proxy, Ingress Controller, DNSand Service Mesh. These components remove the pod IP and port from their internal state and stop routing traffic to the pod. Since each of the individual components is independent and each component might be busy executing other tasks, there is no guarantee on how long it will take to remove the pod IP and port and stop routing traffic to the pod.

The processes of removing the pod from the Endpoint object and sending SIGTERM to containers occur concurrently rather than sequentially. If the pod is terminated after the pod IP is removed from kube-proxy, we don’t have a problem.

However, if the pod is terminated before the pod IP is removed, the terminated pod would still receive traffic during that short period of time. As a result, the clients calling the terminated service would get a 502 error.

Because this behavior is highly inconsistent and depends on how busy the kube-proxy or Ingress Controller are, it has to be handled properly. Kubernetes provides Container Lifecycle Hooks which are executed by kubelet on the corresponding containers when it receives the events. Before the kubelet sends SIGTERM to the container, it executes preStop lifecycle hook. The hook is blocking and synchronous, so only after the command in the hook is executed does the container receive SIGTERM. The total duration of preStop hook should not be more than the terminationGracePeriodSeconds configured for the pod, which is 30 seconds by default. If the pod takes more than 30 seconds to terminate, a higher value for terminationGracePeriodSeconds should be set in the pod definition YAML.

If the preStop hook adds a delay before terminating the container, we get a grace period till the pod IP is removed from the Endpoint object which in turn updates kube-proxy, Ingress Controller, DNS, etc.

How to fix the PHP service?

In order to achieve graceful termination of the PHP pod with ngnix and php-fpm containers, we need to do two things:

Terminate nginx and php-fpm with equal sleep duration and make sure that php-fpm does not terminate before nginx.
Terminate php-fpm with SIGQUIT which is handled gracefully by php-fpm on contrary to SIGTERM which terminates php-fpm immediately.

This can be implemented with preStopcontainer hook in the definition of a pod like this:

apiVersion: v1
kind: Pod
metadata:
  name: php-pod
spec:
  containers:
    - name: fpm
      image: php-fpm
      ports:
        - name: web
          containerPort: 80
      lifecycle:
        preStop:
          exec:
            command:
              - sh
              - '-c'
              - sleep 5 && kill -SIGQUIT 1
    - name: nginx
      image: nginx
      ports:
        - name: http
          containerPort: 80
      lifecycle:
        preStop:
          exec:
            command:
              - sh
              - '-c'
              - sleep 5 && /usr/sbin/nginx -s quit
  terminationGracePeriodSeconds: 30

The sleep of 5 seconds in the preStop lifecycle hook gives a grace period before terminating the containers and should be set carefully. Setting a value for php-fpm lower than nginx means that when the pod receives traffic after the php-fpm container is terminated and nginx container is still within the sleep duration, the request would fail with a 5xx error. Setting a sleep value for the php-fpm container higher than nginx is safer and ensures that no requests will fail.

However, this alone would not make the pod gracefully terminate completely because of the behavior of php-fpm. By default, the php-fpm parent process terminates itself, and its child processes immediately as soon as it receives a SIGQUIT signal. Because the requests are served only by the child worker processes, any in-flight requests would be terminated resulting in a server error.

This can be fixed by setting process_control_timeout configuration for php-fpm which tells the child processes to wait for a certain period before executing the signal received from the parent process. The default value for this configuration is 0. Setting a reasonable value in seconds for this configuration would make the php-fpm container terminate gracefully. We set it at 5 seconds.

Results

Since we implemented those 2 changes, we almost completely eliminated the 502 errors and our legacy PHP service is much more reliable. We are still improving our infrastructure at Personio and we are evaluating using a service mesh that offers better resiliency patterns in the future.