Can you use database pooling with serverless?
Below is a copy and paste of an internal investigation I did for database load issues we were hitting at Oppizi. I think there’s a lot of confusion out there with lambda and database pooling, hopefully this helps others. Please comment if you’ve had or got experience with this.
The problem, production starts falling over with errors like “sorry, too many clients already” and “remaining connection slots are reserved for non-replication superuser connections”
The cause, this happens when our database has to many connections open, when it reaches an upper limit it refuses to open any more and the client starts seeing errors like the above popping up. There’s lots of different ways new connections are opened. To state the obvious it’s when the database is queried, however inefficient code can open more than is needed and if connections aren’t closed after use eventually the upper limit will be hit.
Solution idea #1, use connection pooling, this means that a finite amount of connections can be opened and if all connections are in use pending queries will wait until a connection in the pool becomes free. Pooling does have a slight performance benefit (10~20 milliseconds) in the sense connections don’t need to be established again and again once they are in the pool. This does solve the issue of hitting max connections but it does mean that if we have more demand than the pool can handle new queries will need to wait until a connection from the pool is available. This is okay for load spikes but we could end up back in the same situation if load is continuously more than the pool can handle. That said it protects us against spikes so it’s worth investigating.
Can it be done on serverless?
According to the AWS docs it is possible to setup a pool and reuse it if the same container is used.
So trying it out I fired 100 requests against a piece of code that didn’t use a connection pool and then did the same against a piece of code that did use a pool.
The results were exactly the same except the code that used the pool took a lot longer to resolve
Why does the pooling one take longer? We’ll get to that, but the point is we still hit the too many connections error at roughly the same point.
So I started digging into lambda a bit more, turns out by default it uses a new container for every single one of those requests, which means it would of created 100 pools. I was half expecting some containers would be re-used. Turns out the upper lambda limit is 1000 containers (but you can ask for more from AWS) but it also has an option to limit the number of containers that are used. So I set it to 1 to see what would happen.
Naively I assumed that the same container would of been used for all requests but it turns out lambda behaves synchronously, the same container won’t handle a new request while it’s currently busy. This pretty much kills any chance of pooling solving the max connections problem. By throttling containers we could avoid hitting that upper limit but at the same time we’d have an upper limit on the amount of load we could handle with lambda, which is exactly the same problem pushed somewhere else.
But it may go someway to limiting the problem, giving an individual container a ceiling on connections. My original thought is that using pooling may make things worse as once the container dies we can’t know when this happens and subsequently close pools, so we would bleed connections until they are auto closed by the database.
Is that true? No, unless you have crazy pool settings. Scouring various sources online, AWS keeps a container ‘hot’ for at least 5 minutes. This is so if more work comes in there’s a machine already available to accept the work and you won’t get that dreaded cold start delay. That means as long as you set your pool to close idle connections before 5 minutes are up you won’t have to worry about this scenario.
Remember that delay we saw earlier? Where the same test using pools took on average 10 seconds longer. By default the pg package will close idle pool connections after 10 seconds. That’s no coincidence. When you’re using a pool lambda will wait for all connections to close before sending the response to the client, that’s a pretty major delay. This is because it waits for the node event loop. You can turn this setting off by using the `callbackWaitsForEmptyEventLoop` setting.
So in short you can use pooling but it’s on a per container basis, it won’t solve your scaling problems.
And just to cover all bases, make sure you’re declaring your pool outside of the handler function else it will be creating a new pool for every request not per container.
What else could we do?
Solution idea #2, Write a connect backoff. So if a connection fails due to too many connections it could retry again after a certain amount of time. This would have the same effect a pool has of waiting when there’s a spike in load. With the slight extra overhead of initialising a connection each time.
I’ve spent a couple hours working on a proof of concept here but didn’t get anywhere fruitful using the
pg package. I believe it’s doable but with a bit more work.
Solution idea #3, Improve inefficient code. When we get a spike it’s often caused by an inefficient piece of code going nuts creating lots of connections. We could proactively spend some time looking for problem areas in the code base and rewrite them to be more efficient. And we can retroactively diagnose what happened when we hit a spike to see if the code can be improved. This is always worth doing as time spent here may equate to saved $ on an AWS bill.
Solution idea #4, Add database read-replicas. We can add many of these as we need, the trade off is with $ cost. But if we get frequent spikes due to reading data it should be trivial to add a read-replica database and query that instead. The work here would involve a small amount of code refactoring so that read queries point at one database and update/delete queries point at another and then simply provisioning a read-replica.
Solution idea #5, don’t do anything drastic and wait for Amazon Aurora Postgres support, which in theory will complement serverless and auto scale up and down the database for us. At the time of writing you can currently sign up to trial the beta version.
My suggested plan of action.
- Add a read-replica database, refactor code to point at it
- Improve inefficient code, allocate some time to look for and fix inefficient areas of code, can be done at the same time as 2.
- Put a process in place for diagnosing production outages and learning from them
- Write a connect back off
- Wait for Amazon Aurora Postgres so we don’t have to care about infrastructure to much and can focus on the product