To join or not to join? An act of #includes

Tiago Farias
Goiabada
Published in
7 min readAug 7, 2017

If you’re familiar with an ORM, you’ve probably been bitten time and again by a very common problem when trying to query an object along with its relations. For instance, imagine a very simple relation between entities in a Ruby on Rails project:

Now, what happens when we try to get the user for each book?

If you look at the Rails’ output, you’ll notice something is off as the output shows:

“Something is rotten in the state of my queries”

That, my friends, is the problem of N+1 queries itself. Our initial query (the 1 in N+1) returns a collection of size N, which, by its turn, runs a query for each one of them in the database (the N in N+1).

Well, luckily we only have 3 books in this example. But the performance hit can be very significant. Imagine what could happen if we had millions of them! Depending on how big a collection you have, your computer could even explode! Nope, not really… But your app surely could stop. And I think it would be worse than the former, since in the event of an explosion you can always blame the hardware. Notice, on the other hand, that the N queries are very fast (once they are processed in the database) since they’re matching on a specific id column (an index), therefore improving performance. But don’t be fooled: whenever you can trade N+1 for two or even one query, just pull the trigger. The cost of I/O (especially if the DB is not in the same machine as the app) communicating with the DB is the culprit here.

In this post we’ll examine 3 methods/strategies developed in ActiveRecord (as well) to circumvent the problem of N+1 queries: preload, eager_load and includes.

“There is special providence in #preload”

One way to solve this problem is by having 2 queries: the first one for fetching the association data and the second one would be the query with the final results. Something along the lines of:

And the resulting queries output:

Aah! It is slower than the N+1 way! Not generally, no. In this specific example it was, yes. But that’s only because our seed data consisted of only 3 books. So while preload takes 0.7ms to run 2 queries, N=3 takes (in my machine) just 0.4ms to run. It’s ok, though. Just keep in mind these N+1 queries are using a Postgres index table (id as primary key) and therefore being blazingly fast. For the vast majority of cases 2 queries will beat N+1 all day long.

But there is always the flip side. Well, what do you think would happen if we tried to filter the query just a bit more? Something like:

So ActiveRecord complained because it didn’t find a column “users.name” in the query. Of course, not. preload (the name is a hint) only loads or fetches the data from an association previously in a different query. To use “users.name” in the query, we would need to join the two tables. It’s clear now that preloadis not the definitive answer. For instance, what if I need to access an association within the query? Well, that takes us to the next strategy.

“A join, a join! My kingdom for a join!”

The problem we had in the previous strategy was that we couldn’t access another table’s columns within the query. That happens because preloadalways uses separate queries. Fear not. eager_loadexists for a reason. It loads the data from an association with only one query using left joins to populate associated records. For instance, we could now do:

First of all, Rails’ log screams “LEFT OUTER JOIN” in your face. That’s very unpolite. But ActiveRecord is kind of a know-it-all, it’s just trying to show off on how much SQL it knows. Nevermind about that. But this is actually important to remember: eager_load will always use one query with a left outer join (don’t worry about the outer, it’s the same as a left join).

Secondly, notice that we can access fields from the associated table (users), since ActiveRecord loads both tables into memory (careful about that). That was exactly the problem we had with preload. Now we can run the following and it’s gonna work:

Another interesting thing here is that it’s different from joins . What’s the deal with joinsthen? A few:

1. joins uses an inner join instead of a left outer join.

2. different goal: should be used to filter results from a query, not loading records with its associations. It does NOT eagerly load associations, thus not preventing the N+1 queries.

3. it’s fine to use when you want to filter results from a query without accessing fields from the associated tables, because it does not load or populate associated tables into memory, it only uses them to filter results.

These 3 points also mean that we can use joinsalong with preload, eager_load and (as we’ll see) includes. That is precisely because they have different purposes in life.

Anyway, which one should I go for: eager_load generating only one query with left joins or preload fetching associated data in a separate previous query? You know what? ActiveRecord created this lovely mess for us (and I’m thankful for it ❤), so let’s see how it also solved or at least simplified this dilemma with includes.

“There is nothing either good or bad, but #includes makes it so”

So, what’s the use of includes, you may ask, given that we already mentioned the two eager loading strategies used by ActiveRecord? Prior to Rails 4, includeswas used to delegate the responsibility of the decision to which eager loading strategy to use in each case. Basically, it would spy on your where or order conditions to see if you referenced an associated table there and, if you did, it would delegate to eager_load (as obviously it would need to join tables). Otherwise it would just use preload. Example:

Now, the default case (where condition does not mention associated table):

But since Rails 4, the Rails team has kind of given up on this. In the deprecation warning they mentioned that “doing this without writing a full-blown SQL parser is inherently flawed. Since we don’t want to write an SQL parser, we are removing this functionality”. Plain and simple, includes will work exactly like preloadfrom Rails 5 and above. The former example will even throw an error in Rails 5, as preloadwill tell you that it is not possible to access columns in an associated table that was not joined. There is one caveat, though. If you want includesto join the associated table(s), you need to explicitly tell it by using the referencesmethod. Basically it would become:

In my opinion, prior to Rails 4 includeswas as cool as the other side of the pillow. I understand that it changed because of a very reasonable implementation obstacle, but the fact that referencesexists uniquely for telling it to use eager_load is not really DRY or even clear at all. references cannot be called without includesand includeswithout referencesalways uses preloading strategy! Why not simply call eager_loadinstead of the whole query.includes(:user).references(:user) piece? Why not simply call preload instead of includes(without its personal stalker), which is much more intent revealing? I’ve seen some responses for this, but I don’t know… just not convinced yet. Also, includes has an overhead on top of it to decide which strategy to delegate to. This overhead causes it to be a bit slower than the other 2 methods. Anyway, these Rails folks are really smart. They will surely come up with something here.

Why not simply call `eager_load` instead of the whole `query.includes(:user).references(:user)` piece? Why not simply call `preload` instead of `includes` (without its personal stalker), which is much more intent revealing?

“Sweets to the sweet” — A summary

  • preload, eager_loadand includesare birds of a feather: they are all eager loading strategies.
  • joins is different from the three amigos: Uses inner join to filter queries without loading relations.
  • preload: loads associated tables using always a separate query.
  • eager_load: loads associated tables using always a left join.
  • includes: previously (< Rails 4) was smart (kinda) to find out which strategy was better: eager loading or preloading. Since Rails 4 it uses preloading unless explicitly told to use left joins with references.
  • references: cannot be used without includes, but the inverse can happen and will call preload.

“There are more things in ActiveRecord, Horatio, than are dreamt of in your code”

“To be or not to be?” Prince Hamlet was facing a serious dilemma on whether he should murder his uncle, then step father, King Claudius, or not. Well, I don’t know about that. But as we’ve seen in this post, we shouldn’t be reluctant on killing N+1 queries. That is not really the question you should be asking yourself. The real question here is if you do or do not want to use joins in your queries, and to what extent. Just like in Hamlet, this is hardly a simple question. Hopefully, this post will help you elucidate some of your doubts about the use of ActiveRecord’s eager loading strategies, helping you to make sound decisions when it comes to querying objects’ associations.

--

--