Troubleshooting AWS Lambda + AWS Aurora Serverless Postgres + Node-Postgres

Alli Leong
Extra Credit-A Tech Blog by Guild
5 min readJun 18, 2020

At Guild Education, many of our backend services run on Node.js, AWS Lambda, and AWS Aurora Serverless Postgres databases. These technologies provide us with the ability to conveniently deploy applications without managing servers or containers. But at times, they are also difficult black boxes to troubleshoot. Two types of errors that I recently worked through were:

  1. Lambda invocations terminating with uncaught errors
  2. Database connection terminations upon Aurora Serverless Postgres scaling events

Lambda invocations ending with uncaught errors

AWS Lambda is a technology that runs your code on servers managed by AWS. You implement the code in the form of a handler function, and AWS manages all of the infrastructure to run that function on demand.

While debugging a Lambda function that was using the node-postgres library to execute SQL in Aurora Serverless Postgres, I observed my Lambda requests terminating unexpectedly with the mysterious log message `Unknown application error occurred`. Despite wrapping my Lambda handler code in try/catch statements, the Lambda requests were ending with an uncaught ‘Connection terminated unexpectedly’ error thrown by the node-postgres library.

I had seen this issue before as a consumer of another service at Guild, and remembered that the problem had been fixed. I went to that service code base to see how they handled their database connections. I also googled the ‘Connection terminated unexpectedly’ error message, which led me to parse through several Github issues and source code files of the node-postgres library.

Attach a listener on the client to catch database errors

From these resources, I learned that it is necessary to attach an error listener to node-postgres db clients in order to gracefully handle errors from the database. This is because the node-postgres client is an instance of an Event Emitter, and errors coming from the database are emitted as ‘error’ events back to the client. Without attaching this listener, the errors are uncaught and cause the Lambda to exit.

Database connection terminations during Aurora Serverless scaling events

We have several use cases at Guild for making bulk requests to read or update data from a single API endpoint implemented as a Lambda function. One example is when we invoice one of our employer clients for the thousands of employees who have used their education benefits. The invoice generation process calls the `getEmployee()` endpoint of the employment data service once per employee, resulting in thousands of requests to the endpoint. A second use case is when we need to assign education benefits to tens of thousands of users. These requests read data about the user and write new benefit records to the database in a single transaction.

Multiple Lambdas connect to one Aurora Serverless cluster

In both of these cases, the requests fan out to many Lambdas, which are all connected to one Aurora Serverless cluster. Although Lambdas and Aurora Serverless are designed to automatically scale up and down to accommodate dynamic loads, we have observed cases where the scaling does not behave as expected.

After catching the node-postgres client errors above, the errors that I caught indicated unexpected database connection terminations. The Postgres logs in CloudWatch indicate that during some scaling events, the database abruptly disconnects all sessions and shuts down before restarting. I still need to reach out to AWS support to figure out why this is happening.

Log lines show that the database shuts down and restarts

Even without understanding the root cause, there were a few possible solutions to this problem.

When the queries are independent

In one instance of this problem at Guild, it was the sheer volume of thousands of independent queries against the database that was causing unexpected database connection terminations. In that case, the solution was to implement application logic that recognizes when an unexpected database connection termination has occurred and responds by reconnecting to the database and retrying the request that had failed.

When a series of queries belong to a single transaction

In the case where queries to the database belonged to a single Postgres transaction and were not independent, automatic retry on a failed db request couldn’t work. This was the situation that I was in. Whenever a database connection was terminated in the middle of a transaction, I could not simply reconnect to the database and retry the request because I was now using a new database session and the retried request no longer belonged to the original transaction. I explored two solutions to this issue.

First, I tried using the built-in Data API for Aurora Serverless. As a reference page for the Data API describes, ‘The Data API doesn’t require a persistent connection to the DB cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the endpoint to run SQL statements without managing connections’. The Data API also provides operations to support transactions. This approach seemed promising.

However, because of our use of Postgres Advisory Locks, which are database session-based, the Data API didn’t work.

Instead, I used a second option: retry with a delay. I handled unexpected database connection terminations by waiting 10 seconds before reconnecting to the database, allowing the error from the failed database request to bubble up to the transaction handling layer. Once reconnected with the database, I retried the transaction as a whole. This generous 10 second buffer gives the database time not only to restart, but also to begin accepting write requests again (since when the database first restarts, it restarts in read-only mode and rejects requests to write new records).

Conclusion

These were some issues that I encountered when implementing a feature that relied on tens of thousands of concurrent database requests. Troubleshooting distributed systems can be painful, and I wanted to share my experience in this stack in the hopes that it can help someone else in this situation.

--

--