Optimizing our longest nightly job — a story of sessions, complexity, and toilets

This article was co-written with Cyril Chapellier.

Last year, James (the founding father of the Alan claim engine) shared some details about how the engine works, its requirements and design principles. If you haven’t read it yet, you can find it here.

In this article, part of the conclusion reads as follows (emphasis mine):

One thing, however, that we explicitly did not focus on was performance. The rules of health billing in France are so complex that we put a huge emphasis on keeping the logic as readable as possible, often at the expense of performance optimization. As we scale, we start identifying performance bottlenecks and we address them as needed.

Well, guess what; today, the time has finally come to talk about performance.

A tale of nightly commands

To ensure that we’re always up to date regarding legislation (mainly, new types of cares, or a change in the amount the social security reimburses), we have a batch of commands running every night, making the necessary changes in our data.

Since care types do not change regularly, we pre-compute a lot of related data, the most important of which being our specific guarantees (basically the combination of each care type and a member’s coverage rules).

This pre-computation of guarantees is also done every night, to make sure that all our guarantees are always up to date with the care types and the health coverage of members (that could also change, from time to time) every day.

Nightly commands are great: they don’t impact our members as they mostly use our service during the day (so we try to do heavy computations and database access during the night when — mostly — everyone sleeps).

Except when they take too long.

One day, we noticed that one of these commands was in fact still running after more than 20h, and that the next run would indeed be run before the previous one has ended. This was a big problem — not because the data could have been corrupted (it could not thanks to our idempotency requirement), but because it indicated clearly that we had a performance bottleneck, and probably some scalability issues down the line.

We took on a mission to find out how to improve this run time; we learned a lot along the way. Come with us!

So, what’s taking so long anyway?

📝 Before we start digging into the job specifics, please note that our codebase is written in Python, and that we’re using SQLAlchemy as an ORM.

A quick investigation of our logs showed us the breakdown of what this command does:

  • Load various data from configuration files: about ~1 second
  • Fetch all health coverages we need to work with: about 25 minutes
  • Compute guarantees for all health coverages: about 19 hours. More precisely: we have roughly 30k objects to iterate on, and each computation takes about 2 seconds
  • Aggregate display data for our apps: about ~1 second

By focusing on reducing the time taken by a single guarantee computation, we should reap the bulk of the benefits.

Let’s drill some more

We perform this 2-second computation a lot, but where exactly do we spend time, then? Let’s profile the command locally to see. We chose to profile the work on a subset of 10 objects instead of 30 000, to be able to iterate quickly.

To do that, we’re using a combination of SQLTap (a SQL profiling tool), logs and ad-hoc profiling, to measure where the most heavy processing takes place.

Using SnakeViz (a graphical profile visualizer), we can see that a good 50% of the time is spent in a _validate_care_type_mapping function.

First off, that function is called in a loop, M times, for each mapping in the health coverage guarantees:

for mapping in guarantee.mappings:
# ...
_validate_care_type_mapping(mapping)

This leads us to another function, validate_care_type_selector , which is called G times (the average number of guarantees associated to that coverage).

When scoping our experimentation to 10 health coverage objects, validate_care_type_selector is called 546 times. With the full dataset of 30k objects, we would call it roughly ~1.5M times!

Following the profile, we go down a layer and find another nested loop:

care_types_from_groups = set(
flatten_list(group.care_types for group in care_type_groups)
)

# note: group.care_types lazily performs an SQL query,
# as it is a SQLAlchemy "relationship" field

So here, another iteration of N loops (N being the number of care type groups).

And we finally find the line responsible for about 95% of the time taken by the whole job: looking at the SQLTap report, we see that the #1 offender is an SQL query fetching all the care types:

Wait. So we have, for each computation, N×M×G iterations each doing approximately the same SQL query? And we do this for every health coverage, so C = 30×10³ times, so we basically have something close to a quartic (O(n⁴)) complexity method executing an SQL query in the innermost loop!

Reducing the number of SQL queries by caching a whole table

The SQLTap report states that 297 unique parameter sets are supplied for 297 queries. That’s not great, because away goes any possibility of caching and re-using queries results.

However… Could we cache the entire dataset in memory?

SELECT COUNT(*) as num_rows FROM care_type;

num_rows
----------
514

These numbers are so low that yes, we should probably cache all care types objects once, instead of having to always reach out to our database with simple requests.

That will save us about 1,5M SQL queries per run. 🤩

Down the rabbit hole

While we’re at it, we have another method accounting for 30% of the whole runtime (_update_internal_care_type_guarantee_mappings, not detailed in the first flame graph), and it is called once per health coverage object. Most of that time is spent in SQL queries too, so let’s have a look.

There was one specific query taking a lot of time because the underlying table contains a lot of data (several dozens of millions of rows).

It turns out, that query was executed several times because we were iterating over the SQLAlchemy query, and not the query results.

query = Model.query.filter(...)

for row in query:
...

# later on

for row in query:
...

# even later on

for row in query:
...

We thought that the result would be cached somehow, but it turns out that doing this will lazily execute the query three times, needlessly*.* We change the code to iterate over the result of the query instead, and that saved us many more unnecessary SQL queries.

So, let’s look at the results 🥁

We pushed this in production, waited for the next command trigger, and then compared the execution time by looking at the number of logs emitted over time (shorter execution time means less blue lines).

The overall runtime of the job was… exactly the same.

Engineers when they avoid making SQL queries but it doesn’t change anything, circa 2022, colorized

A session isn’t a toilet, you don’t need to flush every time

Clearly, something was off between our local profiles and production. We decided to take a live profile of the job using py-spy to better understand what the core issue was.

What we found was surprising: we were spending 84% of our time flushing our SQLAlchemy session (i.e. sending in-transaction data from the session to the database engine):

Why didn’t we see this locally? 🤔

We mentioned earlier that we were only iterating on a subset of the objects (10 objects instead of more than 30k) when performing local profiles, assuming that we could linearly derive the production profile from it.

But a consequence of having only 10 objects is that the identity map (the SQLAlchemy internal cache of modified / deleted / new objects) was a lot smaller than in production. where it contained the whole set of objects. We were iterating over this set again and again at flush time, with an increasing list of modified objects.

As Mike Bayer (SQLAlchemy author) puts it: “the size of the object list to be reconciled and sorted during flush gets very large — so issue a flush() after every 100 or so entries added to the session.”

With this in mind, we did what we thought would be clever then: batching our flushes, as it seemed to be a good way to consolidate the fixed cost each n iterations.

We took a pretty conservative n = 100, and while this had yielded great results locally, we woke up to find that it hadn’t improved anything. Again.

After a quick session of collective table flipping, we took another live profile and discovered something interesting: the flushes were still there:

But there was a big difference: these flushes were done as part of the SQLAlchemy session autoflush mechanism:

the flush step is nearly always done transparently. Specifically, the flush occurs before any individual SQL statement is issued as a result of a Query.

As we were still selecting quite a few things on each iteration, we were autoflushing the session at each iteration too!

This explained why our flush improvements didn’t seem to have any impact — the manual flushes we were doing were just replaced by automatic flushes, doing exactly the same thing (at the same performance cost).

Can we finally see some results please?

We disabled the session autoflush by executing the subset of the job where we initially observed problems within a Session.no_autoflush block.

This time, the effect was spectacular: we went from a runtime of more than 21h to … 2 hours and 11 minutes:

We were left with a question: why was flushing so expensive in the first place? In our experience, flushing has a linear cost, proportional to the size of the identity map. While the 11x speedup was certainly appreciable, we didn’t really understand the numbers. Something just didn’t add up.

We ran a simple experiment: we profiled the following piece of code, for 10, 100 and 1000 items.

for obj in Model.query.limit(num_items):
obj.updated_at = datetime.utcnow()
current_session.add(obj)
current_session.flush()

Looking at the profile for the num_items=1000 case, we saw that we were spending a lot of time in postgresql_audit, a third-party SQLAlchemy extension allowing us to keep an audit log of the changes made to most of our models.

The more we increased num_items, the more time we spent in that extension! We compiled out results in the following table.

It became obvious that we were paying a supra-linear price for our session flushes, due to this particular extension. This behavior combined with session autoflushes was a perfect recipe for incremental slowdowns: the more data we operate on, the slower the flushes over time.

Armed with that knowledge, we decided to go further, and try to keep the amount of objects stored in the identity map to a minimum, in order to reduce the cost of our (now batched) flushes.

The quest for a smaller guarantee table

The main table we were working on had more than 30M lines, and this number had been continuously growing, at a much quicker pace than our overall health coverage growth.

Digging a bit deeper on the business side, the table is in fact a Cartesian product table between various models, that had been growing very much since we started creating a lot of different custom health plans to cater for the need of our members.

However, not all the plans that we created were used — which in turn meant that many of the rows were associated with coverages that had never been used. We should be able to exclude a good chunk of the health coverages from being worked on in the first place!

We reached out to the team responsible for the creation of health plans, and found out that about 94% of the objects (roughly 28k health coverages) we had referenced in this table were in fact never used. We thus didn’t need to spend time computing associated data for them.

We tweaked our initial basis query to filter them out. With this patch deployed, the job took … 8 minutes.

So. What did we learn?

Flushing has a non-trivial cost, and it’s easy to miss when flushes occur

We have discovered that flushes are costly when you’re operating on many objects, and they can happen at select time if your session has autoflush enabled.

That means that if you start flushing manually, you probably want to wrap your work in a with session.no_autoflush block, otherwise your carefully placed session.flush() calls might be completely useless (because everything would be flushed elsewhere). You could even disable the autoflush session mechanism altogether.

While the usual cost of flushing is linear with the number of objects contained in the SQLAlchemy identity map, in our case, the cost is supra-linear. Addressing that issue should induce a sizeable speedup throughout our system!

It is easy to run the same query multiple times with iterations

We found out that an SQLAlchemy query object (returned by current_session.query(...) is lazily executed, which means that if you do something like this:

query_result = session.query(Model).filter(...) # returns a BaseQuery[Model]
...
mapping1 = {something for row in query_result} # first iteration
...
mapping2 = {somethingelse for row in query_result} # second iteration

The query is executed twice! If you want to do this efficiently (especially if the query is expensive), you absolutely need to collect the query result, e.g. via the all() method (or by casting the query to a list, which is essentially the same).

Optimizing SQL queries can only go so far without a strong understanding of the business logic

We spent a lot of time optimizing at the periphery of the performance black hole that was caused by the slow flushes and slow queries on the big table containing all health coverages.

One of the biggest performance boost we saw was due to an optimization in the business logic, reducing the amount of data we needed to work on in the first place.

Technical savviness and business understanding have to work hand in hand when it comes to improving the performance of a complex system.

In the coming weeks and months, we’re going to apply the same profiling techniques to speed up the claim engine as a whole. Given our size, not only does it have to be auditable and fault-tolerant, but it now also has to be performant as well.

Thanks a lot for reading, and don’t forget, we’re hiring 🎉

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store