Adding Read Replicas in a Production Ruby on Rails App with Zero Downtime

Published in

Compass True North

4 min readJun 19, 2017

When query, schema and index optimizations aren’t enough to support the load of a high throughput database system, fundamental changes in how data flows through that system are often necessary.

Over the past few years, our team at Compass has been fortunate enough to tackle numerous database performance issues, ranging from autovacuum tuning to full production database upgrades. Growth as a company doesn’t occur without these types of issues and we’re happy to share our experiences with our peers.

Why Splitting Your Read/Write Queries is Necessary

Load balancing is a fundamental concept in any system design. It’s important to design your data systems to handle the peaks or spikes in usage, as opposed to designing to handle the average load a system experiences.

By splitting queries between your primary and follower databases, you’re able to ensure that users interacting with your web application have a snappy and performant experience, while saving heavy background jobs for a follower.

How We Researched Potential Solutions

There are a few options when it comes to choosing a database adapter that supports primary/follower query balancing. After evaluating Makara and octopus, we decided to go with Makara for the following reasons:

Thread-safety: we use Sidekiq for our background job processing, so a non-thread-safe database adapter was a deal-breaker.
Community support / used in large production environments. We read about a few companies experience using Makara and heard positive things.

Plan a Detailed Roll Out Strategy

Given the importance of a highly available database, we had to be very careful in how we rolled out these changes to production. The key take aways were to:

Implement a Kill Switch

Use a database config that could easily be reverted using ENV['USE_MAKARA’] in case we quickly needed to roll back our changes and use the default postgresql adapter.

development: &default
  <% if ENV["USE_MAKARA"] %>
  url: postgresql-makara:///
  makara:
    sticky: true
    connections:
      - role: master
        name: primary
        url: <%= ENV["DATABASE_URL"] %>
      - name: replica
        url: <%= ENV["REPLICA_DATABASE_URL"] %>
  <% else %>
  adapter: postgresql
  url: <%= ENV["USE_DATABASE_URL"] %>
  <% end %>

production:
  <<: *default

We ran into an issue with our first deployment where we needed to immediately roll back. Users’ access to Contactually was unaffected and we were able to quickly address the problem and re-deploy. You can see some more suggestions from this gist here.

Start Slow — Pick a Non-Critical Job

Pick a non-critical, background job to introduce distribute_reads on. We picked our job that generate merge suggestions for our users. If the jobs failed for whatever reason, we could roll back, and just run them again.

distribute_reads do
  # your regular application logic
end

If the Makara adapter is being used, reads will be distributed throughout the pool of replicas. If it’s not, for whatever reason, distribute_reads will gracefully use the default postgresql adapter.

Monitoring is Key

Ensure you have proper monitoring in place. We use NewRelic and Amazon RDS metrics to see the effect of our changes. If you can’t monitor results, you have no way of knowing how effective your changes were.

Example RDS Metrics showing our databases’ health and load.

Effective monitoring helped us quickly (within seconds) determine that our initial rollout had an issue that needed to be quickly fixed.

Warm Your Follower Database’s Cache

If you’re introducing a follower database that has never been used or queried before, it’s important to make sure the database is warmed up before hitting it with lots of complex queries.

Just like a car in cold weather, you wouldn’t want to start your engine and slam on the gas before the engine has had a chance to warm up.

If the follower instances were recently created, initiate simple count queries for your most recently active users to warm both the database and operating system level caches. For more information, check out pg_prewarm.

Results — Huge Success!

Results so far have been a staggering improvement in system-wide database performance improvement. We saw a ~30% decrease in time spent in the database for our web-initiated queries and an almost total elimination of spikes in Queue Depth.

If you or your team have implemented a system like this, we’d love to hear your feedback and share what we’ve learned in greater depth!

Compass is looking for experienced software engineers who are passionate about solving complex problems with code. We’ve taken a novel approach to building business software — focus on the end user — and it’s been working! Our users love us. Come help us build a product that makes the real estate process easy and rescue 10,000s of people from the jaws of clunky, outdated software.