Alan’s Claim Engine: Technical Design Principles
Last year, we at Alan migrated all our French members to our shiny new in-house claim management system. It was a major operational and technical undertaking, with a lot of knowledge gained along the way. This post will cover the main technical design principles that guided the development of the new Claim Engine for France.
What does the Claim Engine do?
As your health insurance, Alans’ principal role is to reimburse your medical expenses. This is what we call Claim Management. At its core, it consists of:
- receiving information about your claims through document uploads from members, or directly from Sécurité Sociale.
- extracting the line items (aka the Cares) from these sources.
- computing reimbursement amounts on each Care
- Sending the money along to the member (we do instant bank transfers!)
There are a few complications I won’t get into, principally ensuring that we don’t reimburse the same Care more than once. Alan’s Claim Engine is our system that performs all those tasks. It is the system that allows our members to be reimbursed for their medical expenses with unmatched speed and accuracy. We’ve invested a lot of energy in making it as automated and error-proof as possible. While it sounds straightforward, there are lots of pitfalls, so as we embarked on this project, we laid out the main requirements, which led to defining technical design principles for the system.
- Automation and real-time: the end goal for our Claim Engine is to be 100% automated and reimburse you in a matter of seconds.
- Allow manual actions: any step in the process can be reviewed and overridden by a human operator.
- Handle mistakes and corrections: while the happy path is very linear, there will always be a fraction of claims that need to be amended. Sometimes it’s because we receive a new document for it, or our algorithm produced a wrong result, or even a human operator made a mistake. Handling mistakes should be easy. They should be fixed in one place (where the mistake occurred), without having to worry about the downstream changes.
- Auditable: we need to have a full history of what happened and who did what. We also need to easily know whether any claims are stuck in a non-terminal state. There are several related legal and regulatory requirements which we won’t go into because they don’t have major technical implications.
- Fault tolerance: things will go wrong. Servers crash, exceptions go unhandled, emails fail to send, payments fail, etc. We need high confidence that the system will recover on its own. Whatever happens, the state of the database should be correct and consistent.
- Handling concurrency: we will have multiple processes (or humans) trying to do related tasks at the same time. And since we are moving money, this exposes us to reimbursing more than we should. The system needs to handle concurrency well, even if it’s infrequent.
Design Principle 1: Clearly separated steps
What is it?
The Claim Management flow is broken down into discrete steps, with clearly defined inputs and outputs. The output of one step is fed into the next step, all the way from the Source (document upload) to the Payment. To keep it simple, we’ll summarize the steps as:
- Extraction: Source -> List of Cares
- Coverage computation: List of Cares -> Amounts covered
- Payment: Amounts covered -> Bank transfers.
Why it matters
It makes automation (Req 1) a lot easier to achieve. We can automate the easy steps, while leaving the hard steps — principally document parsing — to humans. Also, by having a small and finite number of steps, we can develop operator tools for each step, so that operators can take over from the automated engine (Req 2). And since each step takes as input the output of the previous step, we just need to fix mistakes in the step where they occurred. All downstream effects are handled automatically. There is no need to fix anything downstream manually. This helps a lot with handling corrections (Req 3).
Moreover, it helps with auditing and tracking (Req 4): the result of each step is saved in the DB, along with who created it (whether human or robot). It also helps maintain state consistency (Req 5). Indeed, the result of each step is applied either in its entirety or not at all (using DB transactions). So while there are thousands of lines of code to get you from document to payment, you can only be at the boundary between two steps. This drastically reduces the number of states the DB can be in. As a result, it’s also easy to restart after a failure: just identify the non-terminal states and perform the next step. And we can easily spot old non-terminal states that are not advancing, which is usually the sign of a problem. Finally, the individual steps provide easy boundaries for enforcing mutual exclusion to handle concurrency (Req 6).
How we do it
Each step follows the same pattern. It has a
computemethod that takes the input and produces an in-memory output, usually a Python Dataclass. This method does not make any DB changes.
Then there is a
save method, which takes the output of
compute, and performs the following:
- acquire relevant locks and open a DB transaction
- validate the output of
compute. It’s very important to do this after locking and reloading your state from the DB, so that validation is done against a view of the database that hasn’t since been modified by another process
- insert records into DB
- release locks and commit the DB transaction
save methods are the only parts of the code that have any real effects: primarily DB changes, but also sending emails or triggering bank transfers. The remaining 95% of the code is entirely stateless. This helps tremendously in ensuring a consistent state.
The automation bot will run
compute and pass the results to
save, while a human operator, through their interface, will bypass
compute and supply inputs directly to the
save method. Both paths will trigger the next step.
Design Principle 2: Immutable, append-only state
What is it?
We don’t delete or update any records. We don’t lose any history of what has happened. Updated step results are saved with version numbers, the latest being the valid one. These versioned records turn our linear flow (remember: Source -> List of Cares -> Coverage Computation -> Payment) into a tree where the “live” branch is the one stemming from the latest versions.
Why it matters
It’s obviously helpful for auditing (Req 4) : each record has an actor and a timestamp. Having only inserts also makes consistency (Req 5) and concurrency (Req 6) easier to manage, while versioned records allow us to handle mistakes and corrections (Req 3) while maintaining history.
How we do it
We simply add a version number to each record. We have a uniqueness constraint on this version to provide simple concurrency control when explicit locking is not needed. The “liveness” of any node of the tree can be derived from its position on the tree: you’re “alive” if you’re the latest child of your parent and your parent is “alive”. For performance reasons, we actually compute and store liveness on each record, making it one of the few mutable columns in our models.
Design Principle 3: Safe to rerun
What is it?
Many times, we will run the Claim Engine on the same Care more than once. Either because multiple processes are running in parallel, or because we’re fixing some issue, or because an operator clicked on the wrong button, etc. It’s key that this be entirely safe to do. In a loose sense, we can call this “idempotency”: there should be no unwanted effects from running the Claim Engine multiple times. That means no duplicate emails, operator tasks or — especially — payments. But of course, if coverage amounts have changed, we should apply the new amounts.
Why it matters
It gives our operators (Req 2) and customer support agents confidence: they can trigger reruns without worrying about adverse consequences, because there are none: either nothing happens, or something is made more correct. By the way, this also applies to engineers, when we reprocess Cares in batches. We don’t have to be too careful in selecting which Cares to reprocess, since it’s safe. This makes the system more resilient to mistakes (Req 5), because any bug or manual mistake can be fixed by just reprocessing a sufficiently large batch of Cares. In fact, we still periodically reprocess every Care in the system, to verify that we’re still covering the right amounts, and we monitor the output. This gives us great confidence in the system generally.
How we do it
We record all side-effects (emails, payments) in the DB so that we know exactly what has happened and we avoid duplicates. Before sending API calls to email or payment services, it is essential to record — and commit — the intent in the database first. That way, if we fail to record an API response, we can, on the next run, check the external service to check whether the API call succeeded or not. The external service needs to support these queries, which usually means supporting client-generated request IDs.
We then leverage the immutability from Principle 2 to compare new results to old results, only we apply differences. For example, differences in covered amounts will lead to the delta being paid — or held on future payments, if negative.
What about scalability?
We’re getting to the end of this blog post and you may have noticed that we haven’t mentioned scalability yet. Obviously it’s a very important topic: Alan’s member base is expanding rapidly while the medical history of existing members is growing too. So we can’t afford to build a system that is unable to grow with our members. Fortunately, the domain of health data lends itself well to scaling. Your health data rarely interacts with some other member’s health data. So the processing can usually be parallelized, and the data sharded. We just need to watch out for the few non-linear routines we have — for example when we deduplicate cares. We’re currently investing in monitoring all keys components closely to spot bottlenecks quickly.
One thing, however, that we explicitly did not focus on was performance. The rules of health billing in France are so complex that we put a huge emphasis on keeping the logic as readable as possible, often at the expense of performance optimization. As we scale, we start identifying performance bottlenecks and we address them as needed.
These principles have allowed us to develop, maintain and extend Alan’s Claim Engine over the past 2 years. We aim to automate more and more of it, which means handling more and more complex edge cases. Yet as long as we stick to the guiding design principles, we know that we can add complexity without jeopardizing the consistency of the system and that, when mistakes are made, they can be fixed quickly with minimal impact on our members.
If we’ve sparked your interest, we’re always hiring!