Optimizing batches in Rails for a healthcare directory based on event-sourcing

Etienne Depaulis
Lifen.Engineering
Published in
4 min readJul 24, 2018

--

Context

Lifen provides healthcare professionals with an innovative application to facilitate the exchange of medical documents in an easy and secure way.

One of our key components is our health professionals directory because everyday we need to access the most reliable information about which communication channel we should be using. For this purpose we have a Rails 5.1 app (codename “Nestor”) in charge of importing millions of rows of information on a daily basis from multiple sources.

We’ve adopted an event sourcing pattern in order to track accurately the evolution of our data. The process is quite simple. For every source of information we :

  • import the data
  • identify differences which produces Events
  • apply Events to our existing data

We recently run into a very interesting performance problem : the whole update process from all our data sources has to take less than 24 hours (logical condition if we want to update our directory on a daily basis). Therefore, we have a strong focus on performances and more precisely on the time required to process our Events.

We all know that using each when dealing with ActiveRecord collections can be dangerous as it will be loading all events in memory :

collection.each do |event|
process(event)
end

A common solution is to batch the process of those events into smaller “sub-relations”. That’s what we have been doing at the beginning of Nestor :

collection.find_each do |event|
process(event)
end

And to be honest, this pattern works pretty well if we focus only on the memory being used by Ruby. In our situation, the problem was regarding the time spent in the database as find_each magic is all about calling the DB multiple times with smaller batches.

The problem

All our main resources (Practitioners and Organizations) have one or many Identifiers. An Identifier belongs to a resource, and has a system and a value (we use FHIR as our unique data modelization, we did not want to reinvent the wheel on this topic !).

We also have an Events table, 12M rows and growing pretty fast (we monitor in average 5 000 modifications per day as well as some random events surge). All Events are linked to an Identifier using the system and value keys.

When we process our Events for a specified version we do something like:

The problem we encountered is when the initial Events selection (Event.where(version: 18)) contains lots of elements (let’s take 10 000). With a batch size of 1000 elements (Rails default value), it means that we will be running 10 times an expensive joined and ordered query :

  • the first one will be like SELECT * FROM events LEFT JOIN identifiers ORDER BY events.id LIMIT 1000;
  • the next one will be SELECT * FROM events LEFT JOIN identifiers ORDER BY events.id LIMIT 1000 OFFSET 1000;
  • … and so on

ActiveRecord has to use the ORDER BY events.id to make sure we don't process the same event twice and doing so is quite expensive on a joined table (we were experiencing queries in seconds ...).

The solution

To avoid joining tables during the iteration we need to iterate on the raw Events table alone. Once we get the sub-collection, we can finally apply the join condition (WHERE identifiers.active = 't').

Here is the solution we started using as an extension for the in_batches method replacing find_each:

One edge case we noticed early on was when the joining condition was pretty restrictive (less than 10% of the initial selected Events for example) which lead to batching empty collections.

If in the joined (and filtered) collection we have less than 10% elements of the unjoined collection, we use ActiveRecord default pattern in order to minimize the number of batches.

Otherwise we will switch to the with_in_batches method which only applies the join and the where on the sub-collection.

To better manage this edge case we introduced a little variation on our FindInBatches service:

In the find method, we start by counting the number of elements in the joined collection and the number of elements in the unjoined collection.

As indicated in it’s name JOIN_COST_ARBITRARY_COEF is purely arbitrary based on the query time we monitored in both situations.

Remark: “You said you want to avoid joining while batching. But in_batches return an ActiveRecord relation and you apply the join on it, so at the end you have a batching relation with joining as you would have using find_in_each"

The trick is that in_batches does not give a "classic" relation. It gives a relation but in fact it will not use the relation itself but the primary ids resulting of the relation. That's why the final relation will not be a batching relation but only be scoped on a list of ids.

Your can read its Rails implementation to fully understand it.

Here is a little overview of our daily tech challenges, if you are interested in knowing more about Lifen and our current open positions, feel free to contact us at contact@lifen.fr.

Thanks a lot to Matthieu Paret for his very practical performance optimisation on this problem.

--

--