Claim Engine: Saving 60 k$/year by avoiding unnecessary recompute

💼 Executive Summary

  • the Claim Engine is at the core of Alan, responsible for analysing care acts of our members, matching them with their coverage policy, and deciding how much to reimburse
  • it also does arbitrary reimbursements recomputation on existing cares, which is very costly
  • we worked to safely reduce the need for these arbitrary recomputations to almost zero
  • we estimated triggered + arbitrary recomputations to cost 100k$/year (and increasing)

👉 we’re getting there, we’re already saving 60k$/year (and it’s only the start!)

💁🏼 Preliminary

What is Alan’s Claim Engine? We have a great article about it on our blog.

One of the Claim Engine main goals is to compute the amount to reimburse to members, after they have received care acts.

This computation is usually triggered following an event: the member uploaded a document, we got a décompte from Noémie, someone from support unblocked something, etc.

The “Catch-All recomputation job” 🪝

On top of these natural triggers, there is an automated process which recomputes a random portion of all reimbursements. This job is called the “Catch-All recomputation job”.

It is arbitrary because every night, the Claim Engine recomputes a portion of all members’ reimbursement, even though nothing has triggered it.

Why does the catch-all process exist?

Primarily because some events require recomputing the reimbursement, but we are not detecting it, and nothing triggers the recomputation. Examples:

  • someone or something changed a start date on a contract, so we need to recompute the reimbursement for all related members.
  • our products are evolving and it’s difficult to make sure all these changes trigger the recomputation if needed, it currently relies on the engineers implementing them.

Recomputing everything arbitrarily was a great decision to be sure we “catch all cases”: it was simple, easy to grasp and implement, and cheap.

Here it’s important to understand that we take our mission very seriously: when a minor change in the present happens, we make sure to reapply the rules retroactively to possibly reimburse members, and fix mistakes even in the past. Alan is probably the only health insurance to do that!

Why is it a problem? 🔥

Recomputing all reimbursements was OK when the number of members was low, as well as the number of events and triggers. However this situation didn’t scale well with growth.

Recomputing reimbursements accounts for around 65% of our job processing machine time

This has a big cost, but also it puts a lot of pressure on our infrastructure (Queuing System, Database) and it also decreases our Delight Effect by ripple effects, as it delays other jobs needed to serve the members. It was also a direct or indirect cause or aggravation of past incidents.

How much does it cost? 💸

A preliminary (and seminal) internal analysis about scaling the Claim Engine had already collected data, so we were able to compute a “back of the envelope” cost estimation:

  • the Catch-All recomputation is what generates most of the run_from_care_act_id jobs
  • looking at the list of asynchronous Jobs, they consume 4k CPU days of the total 6.1k CPU days
  • Jobs run on our Cloud provider, and we estimated that the annual cost for this “catch-all recomputation job” is around 98.4k $
  • 👉 Reimbursement recomputation costs around 100k$ / year

🤑 Saving 100k$/year by killing this job!

A note on the cost: it’s 100k$/year for our current number of members. The cost grows linearly with members but also with the number of care events, and the added features in the engine. So we believe the cost growth is bigger than linear increase speed. Hence the incentive to kill the job now.

But before plugging it out, we want to prove that removing it won’t have any negative impact.

Indeed, maybe we are missing important “triggers” and the recomputation is not run in many cases, so we can’t live without the Catch-All job? That would mean missing reimbursements, out-of-date information in the app for the members, big increase in Care / Ops sollicitation, and could cost up more than it saves 😟

The plan: kill the job with confidence using “useful recomputation” concept

We worked on a plan to make sure we could unplug the Catch-All job without impact. A concept of usefulness was coined: when running an arbitrary Catch-All job for a member, if it didn’t change the overall status of the account of the member, then it is considered useless. If however, it induced a change, it is to be considered useful.

The plan is to prove that Catch-All recomputation is useless 99% of the time. To do that, we ran the following steps:

  1. add logging and monitoring on the Queuing System which is running the Catch-All job
  2. find a way to measure the “usefulness” of the Catch-All runs, to make sure it’s mostly useless.
  3. extract legitimate recomputations out of the arbitrary Catch-All and put them in dedicated processes
  4. drastically reduce Catch-All recomputation, keep it at extremely low volume and sampling, to catch any future drift in usefulness.

⚙️ Executing the plan

Step 1: fixing missing triggers in the code

The idea is to fix the code where we should react to a trigger to update a computation but we don’t (that’s the “missing triggers”)

This had been done in the first half of the 2023 year by a group of engineers dedicated to the Claim Engine.

Step 2: adding logging and monitoring

The idea is to have a better view of ourQueuing System which is running the Catch-All job.

This has been done by the end of 2023, enabling the team to know what’s happening, if a job was started by Catch-All recomputation or not, etc.

Step 3: measuring the usefulness

We needed a way to measure the “usefulness” of the Catch-All runs, making sure it’s mostly useless.

This was harder than we thought. The reimbursement computation can in fact update a lot of information (and also send emails), numerous side-effects that are to be accounted for.

What if we had a synthetic view of a member status, with all its care acts, contracts, policies, personal details, etc? We could compare this view before and after running the Catch-All job. If there were any difference, it would mean that the job Catch-All run was useful

Luckily, we happen to have exactly that! 🙂 One of our internal tool displays a synthetic view, using a data structure called IPIC.

The Insurance Profile Information Cache (IPIC) is a data structure generated at the end of a Catch-All run for each member. It is used to quickly display information about a member, in our internal tools. It contains a list of care events, how much was reimbursed, etc…

Comparing it before and after the run is a good enough estimation of the usefulness:

  • if the IPIC before the Catch-All run is identical to the new IPIC generated at the end of the run, then it was useless
  • if both IPIC (before and after) are different, it means the Catch-All run was useful.

With this methodology, at the end of Q3, we could graph the ratio of useful/useless Catch-All runs.

We measured that on average only 3% of arbitrary Catch-All runs were useful. We were convinced that we could unplug the Catch-All process without bad impacts.

Step 4: extracting legitimate recomputations from Catch-All process

Part of the arbitrary recomputations are however very useful: they finish computing reimbursements for information coming from our Tiers Payant provider. We extracted these special recomputations and put them on their own jobs. There were around 600 of them per day.

Step 5: drastically reducing Catch-All job, add sampling

We reduced Catch-All recomputations by a factor of 16, and no negative impact was visible. We then reduced it further, and finally kept it at a minimal sampling size — while still remaining statistically representative. We added alerting on this metric to detect dubious behavior, like a sudden increase, which would indicate that we’re missing a new recomputation trigger somewhere.

🏝 Where are we now?

After reducing the number of Catch-All jobs, they were accounting for only 1.42k workers days out of 6.23k. An approximate computation of what we saved was done and it turned out:

👉 we’re now saving 60 k$/year

What are the next steps to get us to saving 100 k$/year? The metrics tell us that other reimbursement recomputation jobs (not part of the Catch-All job) are also most of the time useless (up to 84% of the time). However, we think it’s rather complex to save costs from these jobs, as they don’t amount to enough money (for now — we’ll revisit next year). There are better untapped sources of cost reduction. We’re currently exploring optimizing our jobs management system, as we suspect it’s wasting a lot of time doing bookkeeping. As always, it’s important to find the right balance to avoid spending more than we save.

We’re confident we can continue reducing cost while increasing reliability and delight! 🚀

--

--