Cloud hosting providers, like Amazon AWS, makes it easy to build and scale your infrastructure. But if you are in a fast growing startup you know that balancing performance and cost can be complex and time-consuming.
There are many layers of the stack that can be adjusted and lots of decisions:
- Optimisations in application code.
- Caching, which itself has lots of options (HTTP, page, action, fragment, …).
- Choices about which class of servers (CPU optimised, memory optimised, Intel, AMD, …)
- Reserved server instances vs on-demand.
- Lots of small servers vs fewer big servers, containers ….
Under the time pressures of a startup often the easiest and rational decision is to give Amazon more money and scale out to more servers and/or bigger servers. But sometimes it makes sense to optimise. In this post I will discuss our motivations and decision-making that led to switching to Puma. Details on testing and configuring Puma. The cost savings it delivered. And finally discuss an added bonus we got that led to further potential savings.
FiNC’s backend is predominantly Ruby on Rails microservices. We makes use of Docker containers and Amazon ECS. The containers get hosted on a combination of reserved EC2 instances (currently m5.4xlarge) and on-demand instances which we scale in and out depending on traffic load. We discussed our autoscaling system in a previous article.
Previously all our Rails applications ran Unicorn. This had proved to be reliable. But as of Rails 5.0 the default HTTP server is now Puma. In the Rails world it can be good to follow-the-herd, but the key motivator for considering a switch to Puma was this metric from our ECS dashboard.
Across all our application servers memory was our most constrained resource.
What is Puma? and how can it help?
Like Unicorn, Puma is a HTTP server for Rack applications. However a key differences is Puma is threaded. Each threads runs the same application code and can handle separate requests, while sharing the same memory. This offers the potential to reduce memory usage and ensure higher utilisation of available CPU resources.
When running threads you need to be sure your code, the framework, and any libraries are thread safe. Thread safety is too big a topic to cover in detail here, but a good starting point is this presentation by Aaron Patterson that discussed the introduction of thread safety in Rails 4. Also checkout this excellent presentation by Daniel Vartanov on thread safety and the horrors of running threads before Rails was thread safe.
We were fairly confident our code would be ok, but for extra checks we added this Rubocop plugin to our CI — and also ran our tests in parallel, which can help expose issues.
MRI is probably the most common Ruby interpreter and is what we use at FiNC in our Production environment. MRI has a Global Virtual Machine Lock (in older versions called Global Interpreter Lock/GIL). GVL means only one thread is ever executed at a single time and all threads of a process are constrained to a single CPU.
This might sound like it would severely hamper the benefit of threads, but in the context of most Rails applications requests normally include a proportion of IO wait (waiting on database queries, API calls or read/writes to files). During IO waits MRI can execute other threads. Multithreaded Puma can therefore still serve more requests than a single process.
There are versions of Ruby interpreters, notably jRuby and Rubinius, without a GVL. They can run threads concurrently across CPUs. But we didn’t want to change too much of our Production infrastructure at once, so experimenting with alternate Rubies will have to wait for another day.
Testing & Tuning
With Puma and our use of containers behind a load balancer, we have a fair number of variables that we can control. We can choose how many containers we have. How many Puma processes we run in each container. How many threads we run per process.
Where to start? The first things I would suggest is to get a rough idea of what requests are the majority of the load on the server and of those how much of their time is I/O wait. At FiNC we use New Relic and it provides this nice screen to see the requests that consume the most time:
Four requests are consuming about 60% of the time our servers spend processing requests. With each individual requests spending between 25–40% on database queries and API calls to other microservices.
This suggests we will definitely see some benefits from threads, but we probably don’t want to go crazy and set the thread count too high.
I conducted load tests with 2, 3, 4, 5 and 6 threads. Beyond 4 threads I saw no improvement in performance, and just saw memory usage go higher. It seems like for our application 4 threads is optimal. I will provide details about the load testing in a future post.
With thread count of each Puma process decided upon. I next wanted to determine how many Puma processes to run in each container.
There are a few factors to consider. The first is ALB just round-robins requests to healthy containers. So if a single container gets swamped with lots of slow requests, ALB will continue sending it requests even though other containers may be in a better state to handle the requests. Low process counts per container increases the risk of an individual container getting swamped.
Puma is much better at distributing requests. It is only going to attempt to process a request if it has idle workers/threads. And the request goes only to idle threads. Therefore there may be performance benefits for containers with high numbers of processes.
But I wanted to limit how much of Production we change at once. I decided it was best to try and configure Puma to have a similar memory footprint as the Production Unicorn setup.
To perform the load testing I used the excellent command line tool Siege. I’ll cover this process in more detail in [PART 3]
My first step was to benchmark our current setup with Unicorn. Our Production containers ran Unicorn clusters of 6 worker processes (Unicorn 6W). So our SRE team configured a Staging environment of two containers, each running Unicorn 6W, and had the two containers behind an Amazon Application Load Balancer (ALB).
First I tested the ability to handle a short spike in high traffic. I ran various numbers of concurrent client to see what the maximum Unicorn could handle for one minute.
The Unicorn 6W setup could sustain 40 concurrent clients for a minute. Any more than 40 concurrent clients or longer than 1 minute and requests would backlog. Once backlogged the healthcheck requests from the ALB would timeout and ALB would deem the containers “unhealthy” and kill them.
For sustained loads (greater than 5 minutes) Unicorn 6W could handle 25 concurrent clients.
During the test each application instance averaged 250MB. So each container was about 1.5GB, giving a total memory footprint of 3GB.
I then repeated the tests for various Puma worker counts. Checking what Puma could handle for short spikes of very high concurrent clients and what it could sustain. Also during tests I checked the memory usage via NewRelic:
On boot the Puma instances were pretty much the same memory usage as Unicorn at 250MB. However once I began load testing the memory usage increased. When testing Puma with 4 worker processes each with 4 thread (Puma 4W4T) memory usage per process was 330MB.
Puma 4W4T could sustain 65 concurrent clients for a minute load test. Higher than 65 would trigger “unhealthy” event. For sustained loads Puma 4W4T could handle 50 concurrent clients.
The Puma 4W4T setup could handle twice the sustained load of Unicorn 6W. The total memory footprint of Puma 4W4T with two containers was 2.6GB which was also 15% less than Unicorn 6W.
After this load testing and further testing from our QA team we switched FiNC Mall Production to Puma. Since our Production resources are memory constrained, handling twice the load with a slightly smaller memory footprint means our Production application server costs of FiNC Mall are more than halved.
The Mall is not huge traffic so this only gives a saving of around hundred dollars per month. But it bodes very well for switching other services. I will report back with more result when we switch one of the much heavier used services.
There was one unexpected benefit of the Puma switch which related to our autoscaling. I’ll cover this in part 2.
- I didn’t touch on in this article, but Puma has better handling of slow clients. Unicorn doesn’t handle such clients, so our Production environment has Nginx as a reverse proxy in front of each application server container. We can experiment with removing Nginx and saving additional server resources.
- We should experiment with increasing process counts and using less containers.
- Experiment with alternate Rubies