Optimizing batches in Rails for a healthcare directory based on event-sourcing
Context
Lifen provides healthcare professionals with an innovative application to facilitate the exchange of medical documents in an easy and secure way.
One of our key components is our health professionals directory because everyday we need to access the most reliable information about which communication channel we should be using. For this purpose we have a Rails 5.1 app (codename “Nestor”) in charge of importing millions of rows of information on a daily basis from multiple sources.
We’ve adopted an event sourcing pattern in order to track accurately the evolution of our data. The process is quite simple. For every source of information we :
- import the data
- identify differences which produces Events
- apply Events to our existing data
We recently run into a very interesting performance problem : the whole update process from all our data sources has to take less than 24 hours (logical condition if we want to update our directory on a daily basis). Therefore, we have a strong focus on performances and more precisely on the time required to process our Events.
We all know that using each
when dealing with ActiveRecord collections can be dangerous as it will be loading all events in memory :
collection.each do |event|
process(event)
end
A common solution is to batch the process of those events into smaller “sub-relations”. That’s what we have been doing at the beginning of Nestor :
collection.find_each do |event|
process(event)
end
And to be honest, this pattern works pretty well if we focus only on the memory being used by Ruby. In our situation, the problem was regarding the time spent in the database as find_each
magic is all about calling the DB multiple times with smaller batches.
The problem
All our main resources (Practitioners and Organizations) have one or many Identifiers. An Identifier belongs to a resource, and has a system
and a value
(we use FHIR as our unique data modelization, we did not want to reinvent the wheel on this topic !).
We also have an Events table, 12M rows and growing pretty fast (we monitor in average 5 000 modifications per day as well as some random events surge). All Events are linked to an Identifier using the system
and value
keys.
When we process our Events for a specified version we do something like:
The problem we encountered is when the initial Events selection (Event.where(version: 18)
) contains lots of elements (let’s take 10 000). With a batch size of 1000 elements (Rails default value), it means that we will be running 10 times an expensive joined and ordered query :
- the first one will be like
SELECT * FROM events LEFT JOIN identifiers ORDER BY events.id LIMIT 1000;
- the next one will be
SELECT * FROM events LEFT JOIN identifiers ORDER BY events.id LIMIT 1000 OFFSET 1000;
- … and so on
ActiveRecord has to use the ORDER BY events.id
to make sure we don't process the same event twice and doing so is quite expensive on a joined table (we were experiencing queries in seconds ...).
The solution
To avoid joining tables during the iteration we need to iterate on the raw Events table alone. Once we get the sub-collection, we can finally apply the join condition (WHERE identifiers.active = 't'
).
Here is the solution we started using as an extension for the in_batches
method replacing find_each
:
One edge case we noticed early on was when the joining condition was pretty restrictive (less than 10% of the initial selected Events for example) which lead to batching empty collections.
If in the joined (and filtered) collection we have less than 10% elements of the unjoined collection, we use ActiveRecord default pattern in order to minimize the number of batches.
Otherwise we will switch to the with_in_batches
method which only applies the join
and the where
on the sub-collection.
To better manage this edge case we introduced a little variation on our FindInBatches
service:
In the find
method, we start by counting the number of elements in the joined collection and the number of elements in the unjoined collection.
As indicated in it’s name JOIN_COST_ARBITRARY_COEF
is purely arbitrary based on the query time we monitored in both situations.
Remark: “You said you want to avoid joining while batching. But in_batches
return an ActiveRecord relation and you apply the join on it, so at the end you have a batching relation with joining as you would have using find_in_each
"
The trick is that in_batches
does not give a "classic" relation. It gives a relation but in fact it will not use the relation itself but the primary ids resulting of the relation. That's why the final relation will not be a batching relation but only be scoped on a list of ids.
Your can read its Rails implementation to fully understand it.
Here is a little overview of our daily tech challenges, if you are interested in knowing more about Lifen and our current open positions, feel free to contact us at contact@lifen.fr.
Thanks a lot to Matthieu Paret for his very practical performance optimisation on this problem.