AWS Serverless, KnexTimeoutErrors, RDS Proxy, and you.

Published in

LeaseLock Product & Engineering

8 min readDec 7, 2020

Motivation

Here at LeaseLock, we’ve invested heavily in serverless technologies. As I’m writing this, we have a nightly workload that is about to kick off a fleet of tens of thousands of AWS Lambdas and thousands of AWS ECS Fargate tasks that will run key business processes over the next 12 hours.

Even though we are using serverless technologies for our processing, our main datastore is a traditional single instance of a relational database, RDS Postgres. Historically, these relational databases were designed with specific assumptions about the access patterns of the clients that would be connecting to them. They were typically designed to have a few clients — usually a cluster of backend servers — that would maintain a large number of connections in some kind of connection pool managed locally on each server. These connections would be long-lived and would be shared amongst the various threads and processes all running within a single box.

But in the serverless world, these assumptions no longer hold. We have a lot of ephemeral processes spawning up, and each one of these processes acquires a connection that is not long-lived and does not make many requests.

From the Amazon RDS Proxy intro page: https://aws.amazon.com/rds/proxy/

Many applications, including those built on modern serverless architectures, can have a large number of open connections to the database server, and may open and close database connections at a high rate, exhausting database memory and compute resources. Amazon RDS Proxy allows applications to pool and share connections established with the database, improving database efficiency and application scalability.

The Problem

As our application has scaled, we’ve started running into more and more issues related to PSQL connection lifecycle in a serverless environment. We were starting to see more and more of the dreaded KnexTimeoutError pop up in error logs. For those who have not had the pleasure, we are a NodeJS shop, and we use the well known knex package as our ORM. The reason we were seeing these errors was because we had hundreds of Lambdas all coming alive and executing at the same time. They would all attempt to create connections to our PSQL instance, and the contention would cause several of the connections to time out.

At first, we tried tweaking the connection pooling settings that the knex library exposes. We were able to get by for a while by simply increasing the max size of the number of connections each pool could create. But pretty soon afterwards, we were inundated with KnexTimeoutErrors. So many jobs were failing that we were faced with two options: 1) Come up with a real scalable solution for querying our PSQL instance from Lambda and ECS at scale, or 2) move away from using serverless technologies entirely.

RDS Proxy to the rescue?

Fundamentally, RDS Proxy shifts pooling and sharing of connections off of our Lambda/Fargate instances and moves it onto a globally managed connection pool. We don’t want our RDS instance to be paying the overhead price of opening and closing thousand of short-lived connections per minute. We’d much prefer that price to be paid by an intermediate technology specifically designed to handle that workload.

From a high level, RDS Proxy creates database connections as needed to service client connections from our applications. RDS Proxy keeps those database connections alive and reuses them across multiple client connections whenever possible.

Gotchas and Limitations

RDS Proxy is not compatible with read replicas. This was learned the hard way 😔. We’ve set up the proxy for each master instance in our environments.
At first, we suffered from a lot of session pinning (AWS docs here). This hampered how much efficiency we got out of each connection. I highly recommend reading that whole document, since it gives a lot of information on connection pooling and connection reuse. The key takeaway is that if you do not resolve session pinning, you’ll have a lot of idle sessions in typical serverless workloads.

In our case, we make use of the Extended Query Protocol throughout our codebase. In English, this means that every time we submit a query with a ? in it and send bound parameters to be handled by the db, RDS Proxy detects it and pins our connection for the lifetime of the connection. Though RDS Proxy was helping us avoid paying the overhead associated with creating and destroying the connections, since most of our connections were pinned, this meant that each serverless connection was almost always mapped 1 to 1 to an RDS Proxy connection for the lifetime of the serverless process. The fact that we had so many idle connections meant that we weren’t realizing the full benefits of RDS Proxy and its connection pooling functionalities.

An experiment: aggressively killing connections.

After some analysis of the problem, we decided to embark on an experiment. What if, on the serverless side, we opened up a new connection to the RDS Proxy instance for each query or transaction we wanted to run, and then killed the connection immediately afterwards? Most of our code is not running SQL queries, but rather doing other work — HTTP calls, computations, etc. All of our SQL queries are stateless and do not store any data in the connection itself. Furthermore, since RDS Proxy is meant to shield the PSQL instance from connection creations and destructions, we should not need to worry about hammering RDS Proxy with aggressively short-lived connections.

In short, tweaking the knex library in order to create a new connection for each query worked! For some context, before we implemented single-query-connections, we would have a steady state of approximately 500 connections open from the RDS Proxy to the RDS backend. Currently, we are hovering around 170 connections, a 3x improvement. We’ve also seen a complete reduction in instances of KnexTimeError. Furthermore, we have not seen a significant degradation in performance on the serverless side.

This solution is not one size fits all. It works for our application given our workload. But, what is for certain is that the way we have to manage connections in serverless versus server applications is changing. If we want to leverage the benefits and conveniences offered by serverless, we need to make sure to change along with it!

Bonus: technical deep dive.

Translation: abandon all hope, ye who enter here.

For those of you who are looking to run a similar experiment, we’re providing more technical details on how we achieved aggressive connection killing in knex.js below. And remember, you can always deep dive into knex.js and tarn.js by reading the available open source code on github. Hooray open source node projects.

Tarn.js: This is the pooling library used by knex.js to manage the open PSQL connections. The poolConfiguration you pass down to knex.js is directly passed to tarn.js. After some trial and error, we were able to land on the following configuration for our ephemeral environments (Lambda, Fargate):

{
    min: 0,
    max: 10,
    acquireTimeoutMillis: 30000,
    createTimeoutMillis: 1500,
    createRetryIntervalMillis: 500,
    propagateCreateError: false,
}

Each of the configuration options and their specific meaning for tarn is explained below. If you’re reading this, then that likely means that something is wrong with connections dropping. You are encouraged to go read the source code of tarn.js here.

min: The minimum number of connections in this connection pool. Once created, the pool will always ensure it has at least this many connections, even if the connections are idle and going unused. Since our goal is to kill all of the connections immediately as they free up, this needs to be set to 0, else the pool will hold onto one connection for the lifetime of the pool.

max: The maximum number of connections in this connection pool. The pool will never have more than this many connections in any combination of free, used, or pending creation states. This number is essentially “how many concurrent requests” we can send at once.

acquireTimeoutMillis: The number of milliseconds that can elapse before the pool throws a timeout error when attempting to acquire a connection. In other words, when you issue a query via knex, under the hood, knex will attempt to acquire a connection for the query. If there are free connections in the pool, tarn will resolve with one of the free connections immediately. If there are no free connections in the pool but there’s space to create a new connection, tarn will call the create function and resolve with the resulting connection once the connection has successfully been created. Finally, if there are no free connections and no space, tarn will wait for either 1) a connection to free up, or 2) a connection to die, and then it will enter in to the top of the acquire logic again. If this process takes more than acquireTimeoutMillis, tarn will reject the acquisition request with a timeout error. knex will then bubble this up with a KnexTimeoutError.

createTimeoutMillis: The number of milliseconds that can elapse before the pool gives up on a connection that is being created. If a creation times out, the pool will try to create another connection.

createRetryIntervalMillis: The number of milliseconds that the pool will wait after a connection creation times out before attempting to create the next connection.

propagateCreateError: If true, it will cause acquisition requests to reject after the first connection failure with the error from the creation failure. Note that though this is by default false in tarn, knex sets it to true by default (sneaky!). Therefore, if we want to actually have tarn.js retry connections that have taken too long (WE DO!), we need to set this to false.

Finally, we’ve added code to hook into the release lifecycle event of a connection in the tarn pool. Once a connection has been released, this means that it's no longer being used to service a query or a transaction. This is the very first moment that we can destroy the connection. We've set the idleTimeoutMillis to -1 in the code, which means that connections are immediately considered idle and eligible for cleanup (note that tarn.js will complain if you try to set idleTimeoutMillis to -1 via the config — you have to modify the value once the pool is created). Once we get the event, we call the check function on the tarn pool, which goes through and cleans up any free and idle connections. If the connection has been reused immediately because there is a backlog of queries, it will not be deleted. But this is fine because it is immediately being reused.

In conclusion, the way we’ve configured our tarn pools is to have 30 seconds to acquire a connection, and in a normal state where we don’t have a huge backlog of outstanding queries, each acquisition should result in a new connection creation request. For some reason, it seems that occasionally our connection requests to RDS Proxy can hang. So we’ve configured a createTimeoutMillis that should abort any hung requests. The next request typically succeeds. We have a total of approximately 20 attempts to establish a connection, which probabilistically should be plenty.

# of attempts = acquireTimeoutMillis / (createTimeoutMillis + createRetryIntervalMillis) = 30000 / (1000 + 500) = 20

Furthermore, we’re ensuring that once a connection has been created, it’s either in use or it’s being deleted.

So far, it seems that we’ve been able to eliminate our connection issues. Yay!

Best of luck out there, and feel free to reach out to us if you have any questions. Also, be sure to check out our Careers page here.