Building debug tooling for engineers and support teams alike

Published in

Alan Product and Technical Blog

5 min readMar 6, 2024

A collection of tools — Credits: Şahin Sezer Dinçer via Unspash

At Alan, our core mission is to simplify the lives of our members. However, simplifying the external experience often means embracing complexity internally. Our tiered support organisation ensures that even the most complex issues find their way to our engineers for investigation. Given the intricate nature of our business domain, developing effective debug tooling is not just a convenience — it’s a necessity.

What’s complex here?

We talked previously about our Claim Engine. To put it simply, this is the component responsible for computing various incoming sources of data, process them and ultimately trigger the appropriate reimbursement to our members.

There are various elements of complexity here:

we deal with multiple sources of data to decide how we trigger a reimbursement, and in particular we need to know if some actors already reimbursed part of the care (Sécurité Sociale, Tiers Payant, the member themselves, and sometimes other complementary health insurers they may have)
we handle retroactive edits of our members situation: our members’ personal situation can change overtime, so we need to take that into account when that affects their guarantees, even if we only know 6 months after the fact
we consolidate reimbursements because who likes having 17 different reimbursements under 3€ on the same day on their bank account
reimbursement rules can change overtime, because our offers evolve, but also because of regulatory changes

The list could go on but you get the idea. When something breaks, you’d better be equipped with the best tooling to handle that complexity.

Everybody looks at the same tools

Our support escalation process is designed to solve the most simple to the most complex cases. We designed our internal tooling so all these people look at the same screens, so they have the same view of the world and build a shared understanding of the case at hand.

It always starts with our first level of support: the Care team. They’re the front line, and generally the only people our members will talk to. If a case cannot be resolved at their level, it is escalated to our Operations team, and then to Engineering for bugs or cases that are impossible to resolve without code.

The entry point for any investigation is our back-office, where we display a high level, tailored view of every care event that happened in the history of a member:

The “Care Events” tab of our main back-office tool

Advanced tooling for advanced cases

When the generalist back office is not enough, we expose more advanced tools to understand and manipulate the data. Here you cross the boundary of how the system is technically designed. Such tooling requires some training and knowledge to be used effectively, as it’s closer to how the data is organised in our database.

One example of this is what we call the “claims graph”:

The graph of objects behind a simple dental care reimbursement

This tool visualizes the complex web of claims processing, here for a rather simple dental care. It is accessible not only to Engineering, but also to our Care (those specialised on claims) and Operations teams. This visualization aids in understanding the multifaceted relationships and processes involved in claims handling, facilitating quicker problem-solving and decision-making.

It supports going back in time, identifying which reimbursement decisions were made in the past, and getting more context on each step of the process to analyse what happened precisely.

Bridging the gap of debugging

Yet sometimes this isn’t enough. Engineers need to be involved. There might be a really tricky edge case that needs an investigation, or simply a bug that needs to be confirmed and solved.

For such cases, we cross the boundary of implementation. Typically take the involved objects for the particular case that we investigate, and run the code again with debugging options on. That approach is typically easy on developers laptops, but not on production systems, where we avoid running untrusted code and don’t take any risk with the data.

This is why we’ve built a tool for engineers to “re-run the processing code” and directly get the associated execution logs in our back-office:

The tool will also expose tracing information, so the engineer can investigate which functions have been called in the process, what were the specific arguments, and ultimately follow the code execution in their IDE to understand where the problem happens:

This magic is made possible by the excellent Hunter code tracing library for Python. We decorate the code so it emits code traces when it runs, store them in a temporary stream, and return them to the reprocessing tool, along with the generated logs and some metadata.

The benefits are numerous:

you structurally run the right version of the code, on the real data, so you don’t lose time replicating the production setup for debugging tasks of medium difficulty
the tool is generic enough it can be applied to multiple stages of our claim engine; it could even be generalised to other parts of our back-office if we need to
the approach is infinitely safer than connecting to the production database, with a SQL or Python shell, since you would be just one command away of erasing or mutating data incorrectly
it’s usable by any engineer, even new to the team, as they don’t have to guess what code to run with which arguments; everything is safe and already wired!

Conclusion

Overall our approach to tooling reflects our layered escalation process, and it has served us very well in the past years. We build generic enough yet powerful tools that help Care, Operations and Engineering resolve our members’ issues together, efficiently.

One nice side effect of fully sharing such tooling is that it makes it rather natural to envision evolutions, and build simplified and more specialised tooling as we scale. When a use case is served well by an Engineering focused debug tool that is very frequent, we can discuss with Operations and Care on a concrete basis and make it evolve to a simplified, more focused version so a particular case will now be handled one level before in the escalation ladder.

Of course often times we just eliminate the problem altogether with product evolutions. The best resolution tool is the one you don’t need because the resolution is baked into the product 😉