EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Lesson Learned from a NLB Connection Timeout

How we solved an issue spotted by internal users of our API

Jian Li

Published in

Expedia Group Technology

7 min readNov 16, 2021

Laptop opened with source code on screen — Image by James Harrison on Unsplash

We recently received a ticket from one of our clients that they kept getting timeouts when attempting to connect to our service via the API gateway. It turned out to be a tricky issue but our on-call engineer was able to identify the root cause and implemented a fix. This article explains the problem and mitigation. We believe a similar issue may happen to other services that use a Network Load Balancer behind an API Gateway. Therefore, we would like to share our experience to minimize duplicate effort.

What happened

The story begins with a ticket created by one of our clients. The ticket claims that frequent timeouts occurred when calling our API. The following is a conceptual diagram illustrating how our API is exposed.

Clients call through an API Gateway and NLB to reach our service in ECS — Client calls pass thru the API gateway and NLB

As shown above, a client-initiated API call is sent to the API Gateway first before reaching the AWS Network Load Balancer and then our service in ECS. The API Gateway is managed by the Platform team and it handles Authentication and Authorization for us. The AWS Network Load Balancer distributes traffic to our ECS task. Note that the detailed routing path is more complicated, this illustration is simplified to focus just on the problem.

Digging into the root cause

After the ticket was acknowledged, we checked service health and metrics in ECS. Service health checks looked fine. The service was returning 200 HTTP responses. Everything in the service seemed to be running normally. Our client filed a ticket to the API Gateway team as well, so we had a brief sync-up with them. API Gateway logs showed that they received timeouts from our service and returned a 502 to clients. We began to wonder — Is the NLB messing up?

Fortunately, there’s a blog post already written by another team that ran into a similar problem [1], in which the NLB closed idle TCP connections silently causing timeouts to their Tomcat server. Our monitoring metrics indicated that a similar problem happened to our service. Every API call timeout seemed to be accompanied by at least one “Load balancer reset count” and one “Client reset count”. These two values correspond to AWS Load Balancer’s “TCP_ELB_Reset_Count” and “TCP_Client_Reset_Count”, which are used to store the number of RST packets sent by the load balancer (LB) and the LB client, which in our case, is the API Gateway.

TCP_Client_Reset_Count
----------------------The total number of reset (RST) packets sent from a client to                                                    a target. These resets are generated by the client and forwarded                                                    by the load balancer.Reporting criteria: Always                                                    reported.Statistics: The most useful                                                    statistic is Sum.Dimensions
- LoadBalancer
- AvailabilityZone,                                                             LoadBalancer

TCP_ELB_Reset_Count
-------------------The total number of reset (RST) packets generated by the load                                                    balancer.Reporting criteria: Always                                                    reported.Statistics: The most useful                                                    statistic is Sum.Dimensions
- LoadBalancer
- AvailabilityZone,                                                             LoadBalancer

Excerpts from AWS Documentation of Network Load Balancer metrics, see https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-cloudwatch-metrics.html

Understanding NLB

At the first glance, we thought it could be a bug of NLB, but after checking the official documentation of NLB [3], we believed it was designed to act in this way:

For each TCP request that a client makes through a Network Load Balancer, the state of that connection is tracked. If no data is sent through the connection by either the client or target for longer than the idle timeout, the connection is closed. If a client or a target sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid.
Elastic Load Balancing sets the idle timeout value for TCP flows to 350 seconds. You cannot modify this value. Clients or targets can use TCP keepalive packets to reset the idle timeout.

Whenever the TCP connection has been idle for 350 seconds, the NLB would silently close the connection without notifying anyone and only send RST when there are data packets sent to the closed connection. It’s finally clear how the clients saw the timeouts! The sequence of events goes as follows.

NLB silently closes the connection when reaching the idle timeout threshold.
The client continues to call the API without knowing the NLB has closed the connection.
The API Gateway sends traffic to a closed connection.
NLB sends back the RST packet to tell the API Gateway to use a new connection.
API Gateway times out on calling on the closed connection and returns 502 Bad Gateway to the client.
Meanwhile, the downstream ECS thinks everything is going fine.

An illustration of the steps just described — Sequence of events after NLB silently closes the connection

Solve the NLB twist

Once we identify the root cause of this issue, the easiest and most straightforward solution should be decreasing the TCP keepalive value [4] to a number smaller than the NLB idle timeout threshold of 350 seconds, so that the connection is refreshed before the NLB silently closes it. Unfortunately, our service lives in an AWS shared account which means we don’t have permission to change the keepalive property at the OS level.

Therefore, we adopted the alternative solution provided by the team that met the same problem before. Instead of letting the NLB closes the connection, we ask our service in ECS to proactively closes the TCP connection using a smaller idle timeout threshold, so that the API Gateway stops using the original connection.

Now if you are using a Tomcat server, which is used by Spring Boot by default, you get really lucky, because there’s a very convenient config property “server.connection-timeout” for such a use case. All you need to do is override this property to a smaller value, and your problem is perfectly solved!! Nonetheless, for us, it’s where we started the struggling. Our service leverages Reactor Netty, which by default is used by Spring Boot Reactive. Reactor Netty doesn’t naturally support a similar property. We ended up utilizing IdleStateHandler [5] to enable service timeout configurations, which works as:

Triggers an IdleStateEvent when a Channel has not performed read, write, or both operation for a while.

We use a WebServerFactoryCustomizer to customize the connection channel by adding a ChannelInitializer, which adds the IdleStateHandler to the channel. The code is written as:

factory.addServerCustomizers(http -> http.tcpConfiguration( 
    tcp -> tcp.bootstrap(bs -> bs.childHandler(new ChannelInitializer<>() { 
      @Override 
      protected void initChannel(Channel c) throws Exception { 
        c.pipeline().addLast( 
            new IdleStateHandler(0, 0, Duration.ofMillis(timeout, NANOSECONDS) { 
              private final AtomicBoolean closed = new AtomicBoolean(); 
              @Override 
              protected void channelIdle(ChannelHandlerContext ctx, IdleStateEvent evt) { 
                if (closed.compareAndSet(false, true)) { 
                  ctx.close(); 
                } 
              } 
            }); 
       }})
);

Things are finally back on track! We ran local tests, pushed the code, deployed it to the test environment. Then a new problem came up!! The ChannelInitializer ended up overriding the SSL configuration! The channel pipeline is supposed to include an SSL handler at the front of the pipeline. However, because we overrode the initChannel method, the pipeline is no longer set by the Spring Boot property file anymore. What should we do? Well, let’s add the SSL handler back and see if that fixes the issue.

char[] keyStorePassword = keyPassword.toCharArray(); 
KeyStore ks = KeyStore.getInstance(KeyStore.getDefaultType()); 
keyStore.load(new FileInputStream(ksFile, keyStorePassword); SslContextBuilder builder = SslContextBuilder.forServer((PrivateKey) keyStore.getKey(keyAlias, keyStorePassword), new X509Certificate[]{(X509Certificate) keyStore.getCertificateChain(keyAlias)[0]});channel.pipeline().addFirst(builder.build().newHandler(channel.alloc())); 
channel.pipeline().addLast( 
  new IdleStateHandler(0, 0, Duration.ofMillis(timeout, NANOSECONDS) { 
    private final AtomicBoolean closed = new AtomicBoolean(); 
    @Override 
    protected void channelIdle(ChannelHandlerContext ctx, IdleStateEvent evt) { 
      if (closed.compareAndSet(false, true)) { 
        ctx.close(); 
      } 
    } 
  });

Finally! We tested it in the test environment by making hundreds of API calls and we no longer saw timeouts anymore. The load balancer reset count also has reduced to 0. The NLB twist is finally resolved! The following figure shows the NLB reset count before and after the fix.

Left side of graph shows ‘before’ readings between 0 and 2 while on the right side of the graph all ‘after’ readings are at 0 — NLB reset count before and after our fix

Lesson learned

When we use a Layer 4 load balancer, we need to pay extra attention to the different timeouts from each layer. Only focusing on Layer 7 configurations potentially can cause unexpected errors. The NLB reset count metrics can be easily overlooked when we focus on the health status of our infrastructure. In fact, the issue we explained in this article would not occur if the NLB is replaced by a classic load balancer such as Elastic Load Balancer (ELB) or Application Load Balancer (ALB). But if we make that switch, we would give up the many benefits the NLB brings, such as higher performance than ELB/ALB, the support for containerized applications, and more as detailed in this AWS documentation [6]. This probably exemplifies how we often need to consider the tradeoffs of different technology options and make the best judgment while merrily standing up complex systems in the distributed and unreliable cloud.

References

[1] Lessons from AWS NLB timeouts. Ning, A. Mar 2019. Medium.com.

[2] CloudWatch metrics for your Network Load Balancer. AWS documentation. https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-cloudwatch-metrics.html. Retrieved on Apr. 7, 2021.

[3] Network Load Balancers Connection idle timeout. AWS documentation. https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout. Retrieved on Apr. 7, 2021.

[4] TCP Keepalive HOWTO. https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html. Retrieved on Apr. 7, 2021.

[5] The IdleStateHandler, Netty documentation. https://netty.io/4.1/api/io/netty/handler/timeout/IdleStateHandler.html. Retrieved on Apr. 7, 2021.

[6] Benefits of migrating from a Classic Load Balancer. AWS documentation. https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html. Retrieved on Apr. 7, 2021.