A Mysterious Bug with Nginx

Dexter Chang
Jun 10 · 8 min read
Photo by Kevin Ku on Unsplash

I met a mysterious API bug during the API maintenance, which is caused by an unexpected timeout by the client-side when using Nginx, in this article, I will talk about:

* Before we dive in
- The Overview of the Service
- The Workflow of API
* So, What is the bug?
* How to reproduce?
* How to solve it?
- Minimize the Request Time
- Kubernetes Routing
- Event-Driven Queue
- API Authentication
* Why does it happen?
- Why timed out?
- Why does timed out cause another server to run the request?
- Can we adjust the timed-out?
* Summary
* References

Before we dive in

Before digging into the bug, here’s a simplified overview of our service to let you know how the API works:

  • Azure functions: The function will trigger the API every day (Cron job).
  • Service (Generate-Mail): The API provider, once the API is called, will start to retrieve users’ information, then generate temporary data for sending user emails later.
  • Azure MySQL: Store the temporary data.

A more detailed explanation:

- Azure Function: A serverless solution in Azure, it’s like a single-runnable function on the cloud. Azure Kubernetes: A container-orchestration system, it can automating deployment, scaling and management, etc. — Service: Our backend service that provides API service. — Pod: Like a single VM, it’s like a replica of the service. Azure MySQL: same as MySQL but it’s run on the cloud.

Let’s take a look at the API workflow:

  1. The function Triggered by a CRON Job
  2. Cold Start: It’s a behavior of the serverless application, to sum up, your Function will need time to start up (eg. to allocate the memory).
  3. API Call.
  4. Kubernetes routes the request to the Idle Pod. The Pod executes the request, starts to generate the data.
  5. Insert data to SQL database.
  6. Insert Success/Fail (Error)
  7. Send response.
  8. If the function receives a response or execution time is exceed 2 min, finish execution: the 2 min is configurable on Azure.
  9. Another service (Email-sending service) will use the temporary data to send the email. (We will skip this part to focus on how the bug comes from)

So, What is the bug?

One day morning, I found I received 2 emails from our services, and everyone includes the customer received 2 emails. Let's do some top-down investigation.

  • Email-sending service: It’s a producer-consumer relation, the root cause should behind the producer side instead of this part.
  • Azure Function log (Simplified): The API seems to be triggered nicely (We know the function will timeout due to the execution time is long).
2021-05-30 22:01:05 Executing 'Functions'
2021-05-30 22:01:05 Queue trigger function
2021-05-30 22:01:06 Run POST API request
2021-05-30 22:03:05 Timeout
2021-05-30 22:03:05 Executed 'Functions' (Failed, Duration=120000ms)
  • Azure Kubernetes log (Pod A): Ok, looks like API is called correctly.
2021-05-30 22:01:06 Start API request with API key 
2021-05-30 22:02:06 Send 123 emails out
2021-05-30 22:02:07 End API request
  • Azure Kubernetes log (Pod B)
2021-05-30 22:03:06 Start API request with API key 
2021-05-30 22:04:09 Send 123 emails out
2021-05-30 22:04:09 End API request

We know that Pod A is already running the API request, but why did Pod B also running the request at the same time?

How to reproduce?

To dig into the bug, I tried to reproduce it to see if it is only a rare case, I called the API with the same parameters, headers, API key, and it does call twice in both Pods A and B!

If it’s not a rare case and everything seems to goes well, then what should be wrong? We might assume:

  • Some reasons cause the Azure Function to retry the request.
  • Something let the Kubernetes (or more accurately, the Nginx inside the Kubernetes) retry.
  • The API itself has the bug.

Even the root cause of the bug is hard to tell, but it does have a huge impact on our clients (no one wants duplicate mails to fill their poor mailbox, especially a mailbox which already filled with plenty of spam). The first priority is: Fix that bug!

How to solve it?

The problem is clear: Let the API request only happen once, and we have some clues when we tried to reproduce it, so let’s consider different aspects:

If it is caused by the API request time is too long to make client-side (Azure Function) timeout, then we just need to reduce the execution time. Nevertheless, the service combines with multiple API calls from other internal services, plus we still don’t know how short should an API response to the client, it is too risky to do in this way.

The Kubernetes helps us routes the API requests in an easy way, but in this case, perhaps it is also a root cause? The workflow and the logs show that if we only have one pod, the API can only be called in one pod (even if anyone wants to call again, pod A is busy so the request should be blocked and tell the client-side that the server-side is busy).

However, we cannot promise the high availability of our services, which means if the API call happens when Pod A is down, no one can handle the API request.

To make sure that one thing is only handled by one instance, a queue can easily resolve the consumer-producer problem. We can slightly change the workflow by adding a queue between Azure Function and Kubernetes, when a request is generated, send it to a queue, and Kubernetes consumes the data in it to make sure only one pod receives the request.

Though It is a clean solution, we need to pay more attention to fix 2 services by adding new services, it will take a lot of time.

Back to the beginning, what if we let the request happen twice but prohibit one to execute? A expire key? JWT (JSON Web Token)!

JSON Web Token is a proposed Internet standard for creating data with optional signature and/or optional encryption whose payload holds JSON that asserts some number of claims. The tokens are signed either using a private secret or a public/private key. — Wiki

What we need, is a standard field of JWT — exp (expiration time), the JWT will expire and after expthe request will be treated as an invalid request and not be processing.

We only change slightly for our API key in Azure Function, and let the server able to handle requests with JWT (which is already built-in).

A detail explaination of how it works (suppose: request execution time = 150 seconds, function timeout = 120 seconds):

Fortunately, the bug has been fixed, but the haunting process of debugging pushes us to find the root cause, right? Since the scope is too big to find the root cause, let’s do a cut-down version to find why does it happen.

Why does it happen?

You can find all the codes and run them by yourself with this GitHub Repository. To start, we use 2 containers with Nginx as the load-balancer to simulate the same workflow (but simplified) above:

In docer-compose.yml, we define Nginx and 2 instances for our servers (called node1, node2)

And in nginx.conf, we give 2 servers equal weight:

In app.ts, we define 2 APIs:

  • We use an empty URL to do a health check
  • /sleep will sleep 120 seconds then send the message to the client

You can see after a period of time (2 minutes), the client-side got a timeout, and the Nginx also sends a request to each server.

Let’s take a look at the logs:

Here we found something, the Nginx got an upstream timed out, which means it thinks this server cannot handle the request, then it should turn to another server for the help, but in fact, the server is just busy but still handling the original request, that’s is why we got duplication data!

We can break it down into three pieces:

  1. Why timed out?
  2. Why does timed out cause another server to run the request?
  3. Can we adjust the timed-out?

The timed out is to prevent the client wait for the server endlessly, the client can take it as a signal that this server is not running or have trouble that it can’t resolve my request.

When using Nginx, a request will route once the current server is not available to handle it, that is, the request goes to another server. More specifically, nginx.conf define a group of our servers with upstream:

upstream node-app {
server node1:3000 weight=1;
server node2:3000 weight=1;
}

If an error occurs during communication with a server, the request will be passed to the next server, and so on until all of the functioning servers will be tried. If a successful response could not be obtained from any of the servers, the client will receive the result of the communication with the last server. — Nginx

We can add proxy_read_timeout into our Nginx.conf which is longer than your request time, the API now can keep handling its request and back to the client-side properly.

proxy_read_timeout Defines a timeout for reading a response from the proxied server. The timeout is set only between two successive read operations, not for the transmission of the whole response. If the proxied server does not transmit anything within this time, the connection is closed. — Nginx

That is it, we workaround and finally found the root cause!

Summary

So, next time when using Nginx as a load balancer for servers, remember to check that timed out time for your APIs, if your API will execute more than 60 seconds, try to reduce it or add proxy_read_timeout into your Nginx configuration.

Now, I’m going back to fix this in Nginx again……

CodeX

Everything connected with Tech & Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store