Performant SQL queries in Rails, an odyssey

Stylized image of a fractal. [fractal, digitalartofjax.com CC 2.0]

I am a Rails developer without formal software development training. I started out without having heard of Ruby, Rails, grep or RegEx. I knew zero SQL. But I learned fast and I could solve 99% of problems I was facing in a semi-usable way. Then I learned O notation and became acutely aware that good solutions must be performant in theory, not only the initially small database.

The problem

It really is very simple. There are Users who have many Posts. We want a query that collects only the posts that are users’ last and order them, latest posts first.

Ordering is simple

But what of the core query? How to select only the last association? After several brave attempts I was forced to admit that my ActiveRecord and SQL-fu was nowhere near strong enough to get at this cleanly. Enter…

The ugly

I had an understanding of how to solve this problem in Ruby. I had to loop over all users who have posts and get their last post.

This code started to take tens of seconds to execute once users got into thousands, but it worked in production for a while.

The bad

Soon it became clear that the first implementation had two severe downsides. It was very slow at scale and since it returned an array, not an ActiveRecord query object, it could not be easily extended with other scopes.

I still knew not how to write an efficient query, but I had an idea how to refactor the collection so it was a query object.

The approach to pluck ids and then feed them back into where(id: ids) was with me for a long time. It got the job done.

The OK

Eventually it became clear that the loop over users was a no-no and I was given time and pointers to learn some SQL. I learned that SQL grouping and aggregate function MAX would serve us well in solving the problem. This achieves the needed results, but with a sub-query.

The WOW

Recently even the OK solution started to falter and I was beginning to fret that this time I will not be able to get much performance gains. Boy, was I wrong.

I wanted, finally, to implement an all-SQL, clean solution that would be the theoretically most performant one. Luckily, my knowledge had greatly grown since last I tackled The Problem and I had a clear idea of how to go about solving it.

First, I had to find an answer to the question

“How to execute arbitrary SQL code in Rails and get back a query object?”

To get at this I prepared a test app that mirrored the common association pattern I had had trouble with. Then I wrote specs for scopes I needed to work and bechmarks for different implementations.

There were three competing approaches I had to test:

  1. Fancy subquery approach with .from
  2. Model.find_by_sql(query)
  3. Model.connection.execute(query)

I quickly abandoned the .from approach as it would require a subquery like the solution I already had and focused instead on writing the raw SQL snippet I would feed to find_by_sql and connection.execute.

Then it dawned on me I had previously used raw SQL snippets (primitive ones) in queries by using them as joined tables. I decided to try this technique alongside the 2. and 3. approach.

The joins solution is at least 200 times faster than the previous one!

The takeaway

Ruby is slow, SQL is blazing fast, take pains to solve data problems SQL-side
  1. Use raw SQL and insert it in ActiveRecord’s joins statements for massive performance boosts.
  2. Do your best on problems, and anticipate that as you gain experience, solutions that are orders of magnitude better will present themselves.
  3. Be open to new ideas; test and benchmark competing ideas. Pick the solution that fits best for your use-case.

PS

If you are making queries to Rails record’s :created_at field, be sure to add indexes for them. By default they are not indexed!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.