Rails at Scale: Tricks to Eke Performance from a Budget VPS
In a job not too long ago, I had to deal with getting a production Rails application across the finish line for a major non-profit. The app included a tight integration with Salesforce, and relied heavily on background jobs. To make matters more fun, the production server (a VPS on Rackspace) had only four GB of RAM. That’s what’s on my laptop.
My job was to build new features, including Salesforce (SOQL for the win!) and making sure the server could handle traffic without being reduced to a smoking rubble heap like post-Frodo Mount Doom.
To think about scale issues, I started with diagnostics. How much memory was available at any given point, what was the CPU like, and how were queries running with PostgreSQL. htop to the rescue!
After getting some benchmark stats on the staging server, I built a new production database server. Even though PostgreSQL is quite efficient, and leaves a relatively low memory footprint, separating the db from the app would free up valuable server resources while being quick and easy. As we learned post production-launch, this saved our butts, resource-wise.
A note on resources: Given the budget, I had little option to scale horizontally with the application server. I’d suggested we at least migrate the web application server to a box with 8 GB RAM, but was turned down. Getting a separate DB server was a major victory over managers and budgets.
After migrating the database server to a new instance, I ran some benchmarks using Apache AB and Blitz.io. From what I could tell, it was going to be tight, but there seemed to be enough resources on the 4 GB staging server that we could safely deploy the same application to production.
It Should Just Work
DevOps and database administration are a bit of a black art. More often than not, you can’t find a good answer to your problem in Stack Overflow or Server Fault or in the man pages or GitHub issues. It’s witch doctoring, and the problem is due to the uniqueness of each individual system.
With DevOps and database tuning, you have so many different independent variables it becomes difficult to control for everything: major and minor OS versions; database versions; web and application servers. This is why I cannot with any degree of certainty tell you how much RAM you need for your Ruby application. This is also why installation of a gem like nokogiri can take 2 minutes ‘gem install nokogiri,’ or it involves hours recompiling C libraries from scratch.
All is not lost. Like with your sick patient at the hospital, you can take tests, monitor, prescribe something, then do it all over again.
In my case, there was an additional difficulty with profiling. None of the Salesforce integration had yet been deployed to production. I was able to test on staging, and run some load testing with Blitz.io and Apache AB, but there wasn’t any easy way to stress test the Salesforce work, short of finding hundreds of fake users on Mechanical Turk. So I learned a lot of about the application once it was deployed.
From a DevOps perspective, this isn’t satisfying. You want to test with identical scenarios. The staging and production servers are identical, but staging is the virus in a test tube, and production is the virus in a kindergarten.
The Salesforce integration, generally speaking, worked like this: every fifteen minutes a data sync was run: if Salesforce didn’t have data that PostgreSQL had, it would upsert, and vise versa in the other direction. My worry was that with 2000, then 10,000 users’ worth of data*, BAD THINGS™ would happen that you wouldn’t see with 20 users.
That’s what you signed up for, and the brutal economic realities of the world are that every day you have work is a blessing.
The client was champing at the bit to launch, so I warned them that it was probably going to work, that Salesforce could be a problem at peak hours, here’s my phone number, please don’t call at 2Am.
It Didn’t Just Work
They started calling at five AM. Too much memory use was leading to 500 errors for our users.
First I made the immediate tweaks:
- up the connection pool in Rails.
- Increase the number of connections on the PostgreSQL side, specifically max_connections in postgresql.conf.
- Adding reaping. You can add “reaping_frequency” to the database.yml in Rails. Give it something like 30, so that every 30 seconds, stale or hung connections will return to Rails’ connection pool.
Each of these helped incrementally, but memory problems persisted. Back to htop.
Post launch, at peak times, the production web application was using up to 3.1 GB of memory, which brought the dreaded 500 errors. Even though the database was only ever running at 800MB at heavy load, moving it to a separate server saved us from a ton of additional 500's.
Your basic, garden variety Rails app will use 100 MB of your system memory. This without anything fancy, like a queuing system, or in our case, a Salesforce integration. Our system ran Resque for background jobs, and the Unicorn HTTP server over Nginx.
The system’s biggest memory consumers were Resque and Unicorn. Resque forks each worker into its own process, so it’s loading a new version of the Rails app. This is where some vexing race conditions come in.
Unicorn also runs with child processes. When things ran cold, each Resque worker using 6% of server memory. Each Unicorn worker ran at roughly 5% of server memory. Our system had five workers running for each, which is already 2.1GB, or 55% of total memory. But then in an hour or two we’d see usage up to 3.1GB. Was it just the heavy data load, or was there a leak?
The problem of increasing memory was isolated to Unicorn. Each unicorn worker started at around 190 MB of memory, but over time would bloat to way over 300.
There is a gem called unicorn worker killer, but rather than adding a gem I went the bash route. I wrote a script that checks size of the unicorn workers, and sends a kill -QUIT message to a worker over 280MB. The bloated, Elvis-like unicorn would finish its final request, then die gracefully, and be reborn as a tiny little skinny unicorn. This ran every hour on cron, and solved our problems with memory. The 500s went away.
Things I would have done had I the time.
- it was a complex Rails application, especially with the Salesforce data syncs. Better application monitoring would have helped to diagnose which jobs were causing problems.
- Converting from Resque to Sidekiq. One blog I read suggested that Sidekiq was better than Resque at almost every metric except for high CPU jobs. Since CPU wasn’t a problem, dumping Resque would be a better solution.
- Using something other than Unicorn. You have Passenger, Phusion, and Rainbows if you’re really stuck on Unicorn.
*take a look at your User model in a Rails app when you’re thinking about scale. I once inherited a users table with 46 columns. That it violated most of Normalization Forms drove me crazy. Why is your user object carrying around all that data all the time?