Internet and C10K problem, and the need for the concurrency
The Internet is a big place now. From 1996 to 2018, the number of users has significantly increased. Reason being that more and more businesses rely on digital products to engage their customers. The real estate market has not remained untouched from this wave of digital innovation too. Housing.com, being a leader in real estate markets’ digital product offerings, needs to consistently beef up its infrastructure in terms of scalable application servers, caching layer, and load balancers, etc. Since its infrastructure is hosted on AWS, we also need to ensure efficient(maximum CPU usage without hogging up the RAM) utilization of resources on the cloud while providing the hasslefree reliable user experience — supporting multiple concurrent client connections. With that intent in mind, we started evaluating the scalability parameters of our application servers — Unicorn and Puma.
Housing and Service-Oriented Architecture
Housings’ backend is powered by multiple web services. Housing hosts around 30 restful web services and hundreds of APIs both public and private. A web-service transaction generally requires it to connect to other web-services including other layers of peripherals like databases, caches, and search clusters to perform its job. So any web transaction spends most of its processing time in connecting or waiting for the other services to respond, inherently making these input-output bound processes.
If you look at the above image closely which provides a time-slice graph of an application server, you would see that half of the time is spent in Request Queuing, waiting for the Redis, Memcache, Postgres, and web external. The rest of the time is spent on ruby and its middlewares doing the actual job.
As for the CPU bound computations, if there is any, we offload them to asynchronous worker queues. Our APM insights shared above in the image reveal that most of the services, more or less, spend 50% of processing time in doing IO rather than the actual work. So we started benchmarking unicorn and puma for IO intensive workload.
Unicorn vs Puma
Both unicorn and puma are HTTP ruby application servers with a different model of concurrency.
Unicorn is a multiprocess single-threaded HTTP server. Unicorn needs one process for handling each client connection, so you would need X number of processes for handling X number of active connections on a single host.
Puma is a multiprocess-multithreaded HTTP server where each thread is capable of handling a concurrent client request. So conceptually a single process with X number of threads can handle X number of concurrent client connections.
For example, assume a quad-core processor host. We can only process 4 client connections parallelly. Since the workload is mostly IO driven, these 4 processes would not be consuming the CPU cycle very efficiently. The CPU would mostly be idling. So we increase the number of processes to 8 now. Now our host can connect to 8 concurrent client requests out of which only 4 would be executed parrallelly and when these requests wait for IO, the rest of the 4 requests would be consuming the CPU cycles thus increasing the CPU utilization. So in an IO driven workload, with a mulitprocess-single threaded webserver(Unicorn), we increase the CPU utilization by increasing the number of processes(forking a child process). In case of puma, we can achieve high CPU utlization by creating 2 process with 4 threads(puma x2:4) each, or by creating the 4 process with 2 threads each(puma X4:2). This configuration of number of process and threads per process is entirely dependent on the very nature of the APIs itself. Typically an API with lots of computations would need more processes and fewer threads on each process. While the completely IO driven APIs would only need one process with as many possible threads limited by the CPU utilization.
So based on requirements of an API, puma X4:2/puma X2:4 can utilize the CPU as efficiently as 8 unicorn processes for handling 8 concurrent connections. Assuming that each process takes 100MB RAM, so the unicorn would need 800MB of RAM while Puma would either need (400MB(4 processes) + 10MB(2 threads)) = 410MB for puma X4:2 or (200MB(2 processes) + 20Mb(4 threads)) = 220 MB for puma X2:4.
So conceptually puma would provide similar CPU utilization as that of Unicorn in almost 50% or lesser of total memory consumption by Unicorn. So we concluded that Puma is more scalable than the unicorn for handling concurrent client requests.
We also wanted to benchmark the speed of Puma configurations vis-à-vis unicorn processes so that our APIs do not become slower after the migration. so we ran a benchmark on the varied workloads. Below are the results of the benchmark
Sleeping for 2 seconds
This test simulates requests with constant time IO work and as little CPU work as possible. As expected, Puma completely outperforms Unicorn.
Sleeping for 2 seconds + Rendering
An ideal IO-bound process with no CPU processing is not realistic. So we simulated some HTTP APIs that start with preparing for some data for rendering, then sleeping for 2 seconds, and finally rendering it. This simulation was more realistic to the real world HTTP endpoints. The performance trends are almost identical to the “sleep” test, but the throughput is a bit lower
The main advantage of puma over unicorn is that Unicorn process executes everything sequentially while puma can leverage the power of threads in computing the data for rendering while its other threads are sleeping.
So we concluded Puma outperforming Unicorn for IO-bound and IO + CPU bound workloads.
Having completed the benchmarking, we moved some of our ruby on rails services to Puma. It required us to ensure that our application code and third-party libraries were thread-safe. We saw an almost 50% reduction in the RAM usage of one of our service with similar API request time and throughput.