The SQL Alternative To Counter Caches

The counter cache feature has been in Rails since the early days. It is a strategy to efficiently display a count of related records for each item in a list. Using the example in the linked documentation a view without a counter cache might look something like:

<% for author in @authors %>
<%= author.name %> has written
<%= author.books.count %> books.<br>
<% end %>

By changing `author.books.count` to `author.books_count` and using a counter cache in your models you avoid doing a SQL COUNT query for each line. No calculation, not extra SQL, just a simple read of an attribute on the author record when outputting your list.

Maintaining this count is also very efficient. Hooks are set in the Book class which issue a simple SQL UPDATE statement to increment or decrement the counter cache field. It doesn’t even re-sum the total, but just bumping up or down by one. Also that bump is carried out entirely by the DB. Talk about efficient!

The Complexity Costs of Counter Caches

While this is all wonderfully efficient it’s not without complexity costs.

  • It requires a new field to be added to the schema.
  • You must write code to initialize the value for existing data.
  • It uses ActiveRecord callbacks to trigger this bumping of the count. This adds weight to your models even when operating on sections of the app unrelated to displaying the counts (tests for example).
  • It can get out-of-sync due to the simple updating strategy. I’ve seen this happen both due to bugs in Rails (now fixed) as well bugs in applications.
  • It’s fairly simple implementation means it cannot support things like conditional counts, multi-level counter caches, totaling instead of counting, etc. The counter_culture gem does those things, but that adds even more complexity.
  • Depending on the nature of your app, counter caches may be write heavy. If the child models are created/destroy/updated often, we are constantly changing the cache (perhaps even more often than we actually need to read the cache). If your app is susceptible to this you might get more lock contention on your DB than desired. The writes may be adding more load to your DB than you are gaining from the easy reads.

In most cases, counter caches still are the most efficient method, but I have been toying with an alternate method that avoids submitting N+1 queries from Ruby, is fairly efficient (even if not absolutely the most efficient) and doesn’t have many of the costs mentioned above.

The SQL Alternative

An alternative to maintaining cache data is to keep it dynamic but let the DB do the work. To do this we define a scope on our model which adds a sub-query to the selected data:

class Author < ApplicationRecord
  scope :with_counts, -> {
select <<~SQL
authors.*,
(
SELECT COUNT(books.id) FROM books
WHERE author_id = authors.id
) AS books_count
SQL
}
end

With our scope defined we change our controller to use the the scope whenever we want to access the count information:

@authors = Author.with_counts.all

In your view, place the same code as the counter cache implementation (i.e. read from `books_count` on the author object rather than `books.count`).

Advantages

At the core, we are still executing a count for each row, but we are letting the DB do it while pulling the list. Not quite as efficient as a counter cache, but it still executes very quickly. For this slight increase in read complexity we gain:

  • no new field
  • no AR hooks maintaining data
  • always in sync and real-time
  • no writes
  • only a performance cost when actually using the counter (i.e. no callback costs)

While the scope might seem a bit complex it is not terribly so. I think worth the advantages above.

Accessing a Single Record

Using SQL for the counter works great for a list. What about when you want to display just a single record and the count with it? I.E. the user has found an item from the list and wishes to drill down. You want to display the count in the detail record as well.

Your first option is simply to do multiple queries by putting the following in your view:

@author.books.count

Not being in a list we are just talking about one extra query, no big deal. The second option is to use the scope by loading your data a bit different. Instead of the usual:

@author = Author.find params[:id]

You want to use:

@author = Author.with_counts.find_by id: params[:id]

Like with the list we are still doing a count but we are doing it with the author load and getting the DB to do it which is very fast.

More Complex Counts

In addition to the gains of the SQL approach we also can have more complex counts implemented easily. Want to output the total number of pages an author has written (i.e. SUM instead of COUNT)? Just tweak the SQL:

scope :with_counts, -> {
select <<~SQL
authors.*,
(
SELECT SUM(pages) FROM books
WHERE author_id = authors.id
) AS pages_total
SQL
}

Want to maintain a multi-level count? Let’s assume above authors there is a publisher model and we want to know how many books each publisher has produced. Add the following to the publisher model:

scope :with_counts, -> {
select <<~SQL
publishers.*,
(
SELECT COUNT(books.id)
FROM books JOIN authors ON books.author_id = authors.id
WHERE authors.publisher_id = publishers.id
) AS books_count
SQL
}

IMHO, these scopes are preferable to using something like counter_culture as another gem adds more memory usage, complexity in configuration and more callback to be run.

Efficiency

The key goal of the counter cache is efficiency, how much efficiency are we loosing with a dynamic SQL-based scope for the counter? Lets consider three situations:

  • Moderate data sets
  • Data sets at large scale
  • Real world usage

The below script will be used to generate the relevant data:

AUTHOR_COUNT = 100
BOOKS_PER_AUTHOR = (1..15).to_a
ActiveRecord::Base.transaction do
AUTHOR_COUNT.times do
books_count = BOOKS_PER_AUTHOR.sample
author_id = Author.connection.insert <<-SQL
INSERT INTO authors (books_count)
VALUES (#{books_count})
SQL
Book.connection.execute <<-SQL
INSERT INTO books (author_id)
VALUES #{("(#{author_id})," * books_count).chop}
SQL
print '.'
end
end
puts

By changing the constants we can switch between “moderate” and “large scale” data sets. We’’ll start with moderate (100 authors with 1–15 books each).

Counter Cache Performance

First, lets look at how much time it takes to query just the authors models (all that would be needed if using the traditional counter cache). Generally it took around 0.7ms on my machine for the DB to execute the query.

Dynamic Counting via SQL Subquery

The SQL strategy generally took around 1.2ms on my machine. This is a 70% increase, but we are still just talking about an extra 0.5ms. Every ms counts when trying to get to glass in 100ms, but I think for a lot of apps that extra half a millisecond is worth advantages of using a dynamic count.

Large Scale

For a large scale test, we’ll pretend we are Amazon with 500,000 authors each with 1–50 books. In this scenario, our baseline counter cache implementation generally takes around 150–200ms on my machine while the dynamic version takes around 3.7sec. Yikes! Is our dynamic version doomed at large scale? I don’t think so. Let’s move into the real world.

Real World

Most of the time when displaying information to a web page, we don’t output all records in the DB (even at large scale). Information is filtered, paginated, etc. While we may have 500,000 authors with 1–50 books each we might only display 25 of those authors at a time. The SQL LIMIT clause added by pagination for example means the DB only needs to run the subquery on the documents selected.

Paginating like this means even with a large scale database it is only taking around 1ms (compared 0.4ms using the counter cache) and on moderate data sets it is only taking around 0.7ms (compared approx 0.4ms using the counter cache). IMHO, the dynamic SQL version in most situations can operate even at large scale.

Operating on Counts

There is one key area where the SQL approach falls down, operating on a count. For example, show all authors with more than 10 books (i.e. the subquery is in a WHERE clause instead of the SELECT). You can do this via:

scope :published_at_least, ->(min) {
subquery = <<~SQL
SELECT COUNT(books.id) FROM books
WHERE author_id = authors.id
SQL
where "(#{subquery}) >= ?", min
}

Or perhaps you want to sort the authors so that those with the most books are displays first (i.e. the sub-query is in the ORDER clause instead of the SELECT). You can do this via:

scope :by_proliferantness, -> {
subquery = <<~SQL
SELECT COUNT(books.id) FROM books
WHERE author_id = authors.id
SQL
order "(#{subquery}) DESC"
}

Neither of these scale, because even with pagination it must still evaluate the subquery for each record in the DB to determine that order or apply that filter. This means the help we get via pagination is now gone.

If you know your data set will be moderate in quantity, I think the SQL approach is still valid even when operating on counts. But if you know your data set will be large and you want to operate on the counts, you are better off using the traditional counter cache.

If you are not sure (i.e. your data set is small now but you hope to grow) I would start out making it dynamic and when you start to reach the limits of the dynamic approach refactor to use a counter cache.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.