A series about scaling reasonable tech defaults into unreasonable situations.
How did we get here?
It’s the same story for every new startup: You want to quickly roll out an API for a proof-of-concept app and you want to spend as little time and money doing it as possible. While there are many new choices in recent years to accomplish this well, chances are for a reasonable default in the past you’ve just gone with Rails and a Postgres database.
Most applications never reach a very large scale, so this is largely where the story begins and ends. But, if you’re lucky, one day you may wake up and realize you have more users than you can comfortably comprehend. You need that cutting edge stack you always dreamt about, but right now you have a ton of things to do, an existing dataset, and users that won’t patiently wait for you to turn the lights out and fix things.
This is precisely what happened at Hornet just a year in, and we quickly realized that the best path forward was to try and make Postgres work as best as possible in the medium term. I am happy to report that still, 5 years later, it’s the very same core database that powers the app. Although many data responsibilities have since been shipped out to other technologies (more on that later), ultimately we still have Postgres serving to millions of concurrent users in some form.
Part 1: Connection pooling
It’s no secret that Ruby isn’t the most resource-efficient language, especially in its earlier years. The good news is that most of this inefficiency for Rails APIs is in code execution itself and hence CPU time, so a realistic short term solution is simply to add more servers and processes to the application to serve more requests in parallel. This is precisely the first step we took, and since we hadn’t done any optimization this required a huge number of application and jobs processes, which in turn meant that the total number of concurrent connections to Postgres went from just tens to thousands. Unfortunately this hit one of Postgres’ first key weaknesses.
PG and max_connections
Postgres achieves concurrency with a forking multi-process model (much like common Ruby deployments). Each connection from a client is handed a process from the current pool of processes, and if none exists a new process is forked to handle the connection. However, this pool of processes is limited to the
max_connections setting. The best value for this setting varies substantially based on setup, memory available and so on but it mostly comes down to contention on CPU and I/O, and typically tops out performance at a few processes per CPU core. The optimal setting for us was just over 100
max_connections - way less than the several thousand connections we required.
However, as it turns out, at least half our request time was spent in Ruby code execution (thanks ActiveRecord!) which meant most database connections were spending time idle. This means if we moved away from having a 1:1 coupling of clients to Postgres connections, we could serve a much larger number of clients by reusing connections. Thus emerged a solution: upstream connection pooling
There are several connection pooling solutions for Postgres. We ended up selecting pgBouncer since its asynchronous event notification architecture makes it very efficient and more importantly it offers a “transaction” pooling mode. The latter is very important, as it allows a single Postgres connection to be swapped around between clients, but also lets transactions appear normally to each client as if nothing strange is going on. This works particularly well for statements outside transactions (such as a simple
User.find(id) from replicas).
First attempt: Local pgBouncer
Our initial implementation of pgBouncer was to have it run on the same machine as Postgres, as this basically eliminated any network latency between the two. We upped the file handle limits on the machine to accept up to 20k connections:
At first this seemed to be working well, but we were soon plagued by intermittent spikes of system CPU time during which Postgres was extremely unresponsive, and this started to cause request backlogs in Rails:
We noticed that the spikes seemed to originate from
kswapd. Based on the frequency they occurred at, it seemed likely that it was the Linux kernel performing page defragmentation as part of its Transparent Huge Page feature. Since pgBouncer also spends a lot of CPU time in the kernel (due to the huge amount of events and network sends and receive calls generated) the contention seemed to be causing a complete breakdown of Postgres performance while the dirty pages it generated were waiting to commit. We disabled this feature entirely by running:
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Indeed, disabling the defragmentation seemed to get rid of the spikes, but the performance of the machine and Postgres seemed to be degraded overall.
Second attempt: Dedicated pgBouncer Host
The next obvious solution was to move pgBouncer to run on its own instance:
We placed pgBouncer in the same EC2 placement group as Postgres to get the lowest possible latency. This setup worked much better and alleviated the degraded performance we were seeing on the Postgres instance. However it seemed to introduce a new odd phenomenon: During peak times, queries were now taking much longer on average than they were before but without much load on Postgres. Network latency between the hosts during this time was normal. Additionally the load average looked healthy on the pgBouncer instance. However, something caught our attention from
pgBouncer had a 0.5 load average on a 2 CPU-core machine and seemed not to exceed this value, which presumably implied that pgBouncer only made use of a single CPU core and was hitting a ceiling with it. I presume this is due to its implementation on libevent, which while being very efficient does not scale across multiple threads, and hence CPUs/cores.
Sure enough, running another pgBouncer process on the same machine but on a different port, and connecting some machines to the second process lowered CPU well below this ceiling and fixed the issue of increased query times:
Solution: Multiple pgBouncer Hosts
There were a few flaws to the above implementation, however. The most significant is that it created a single point of failure for all databases: the pgBouncer host. Any network latency on this host would slow down the entire application, and a dead instance would bring down the application entirely. Additionally, as we performed further optimizations and traffic grew further, the number of database queries per second increased too and this exceeded the capability of a single pgBouncer.
For the next implementation, we decided to run multiple pgBouncer hosts and make these as commoditized and easily replaceable as possible (this is the cloud after all!). On startup, we simply auto-configure these to point to the database we want, and then run several in an autoscaling group:
Each pgBouncer host handles
max_connections / number_of_pgbouncer_hosts connections to Postgres, so this meant with 2 hosts and 100
max_connections, that each host only needed to deal with 50 connections, giving more headroom to each. With this setup, we also provisioned more pgBouncer hosts than we required so that we could have some of them fail without impacting the application.
pgBouncer provides some very useful statistics like the number of clients waiting, queries per second and so on which are queryable using a standard Postgres connection to pgBouncer. Naturally being on EC2 we followed the path of least resistance and piped these into Cloudwatch dashboards:
Load balancing between pgBouncers
Since we now had several pgBouncers that we need to point Rails to, we needed a way to distribute connections across all of them. We investigated several proxy solutions (haproxy, nginx, ELB etc.) but these ultimately these suffered from the same problems that pgBouncer did — there’s just no good way to push so many network packets and connections through a single host without hitting limits in the OS, CPU, networking, and just about everything else.
So how were we to load balance to the load balancer? It seemed like a case of turtles all the way down. In a spot of luck we realized that we did not need to distribute load perfectly between pgBouncer hosts because every Postgres and pgBouncer instance are largely under capacity, so we simply told the application to randomly shuffle the list of hosts and pick one to connect to. Since the mersenne twister employed by Ruby’s randomization functions has a largely equal distribution, we found that serendipitously we tended to get fairly equal distributions of connections across pgBouncer hosts without needing to know any state information:
Initially we took care of pgBouncer host failures via a complicated fallback system that used TCP health checks, service discovery and DNS to route Rails away from bad pgBouncer hosts. Despite this being fairly standard practice this turned out to be really messy because of the delay in detecting failures, propagating records, and the huge amount of configuration management required. Typically the Rails app servers that were connected to a bad pgBouncer host would error out several requests or hang while this failover was orchestrated, pushing errors down to our users.
The solution to this was to simply let clients (such as the Rails applications) know which pgBouncers were available and let them do their own intelligent retries until the background replacement of the bad host was done. We extended the
activerecord-postgresql-adapter slightly to add the following functionality:
- Allow specifying multiple database hosts in
- Automatically round robin between the specified hosts to distribute load
- Recover from dead connections by reconnecting to the first healthy host in the list
- Retry idempotent statements to recover from failures even faster
I’ve extracted this adapter and made it available as a gem (improvements welcome!): activerecord-postgresql-fallback-adapter
While traditional wisdom tends to favor handling upstream failures transparently, over the years we’ve actually become fans of having more intelligent network clients that are aware of the upstream service topology and can take an active part in failover and optimization. Have a look at the Cassandra Driver for a great example of this.
As the last piece of the puzzle, we needed to ensure that bad client connections died quickly, and reconnected to healthy hosts as soon as possible. Unfortunately it seems like most Postgres client libraries like the
pg gem based on libpq do not manage TCP connections nor enforce timeouts all that well, so we've had to tweak the TCP/IP stack of most clients to drop unstable TCP connections more aggressively. This can be accomplished by lowering
net.ipv4.tcp_retries2 and similar settings. I won’t post specific values as it’s important not to blindly follow kernel tweaks but rather have a good way to test and verify these tweaks in your own environment!
End of part 1
It’s easy to see just how much work can go into something like connection pooling, but the combination of Postgres’ very limited connection handling abilities and the need for many Ruby processes creates a very special need in this case.
In the next article I’ll share more about replicating and sharding Postgres to scale things up even further.