How we replaced a core system by testing in prod.

Published in

Alan Product and Technical Blog

7 min readSep 18, 2023

At Alan, we love being pragmatic and keeping things simple. In the beginning, our members were in the simplest of situations, so our Member Management system was simple as well.

As we grew, we had more members, and more products, offers, and services. This increased the complexity of members’ situations to manage, a lot.

Members can be covered by their company, an individual contract, as a partner of an employee, or choose to decline their company coverage. They have access to different services. They leave Alan and can come back later. They have 2 jobs, thus access to 2 different health coverages…

The Member Management system evolved over time to match new needs, while still working, it was a bit of a mess. Each team adding a new feature would add to the system. There was no unified vision for a long time. We ended up with a house of cards with a bunch of band-aids to try and handle any edge case.

We accumulated tech debt. It was more and more costly to add new features to the system. It was too easy to make a mistake. And each mistake would in turn be costly. Handling incidents takes a lot of time and will erode our members’ trust. So we decided it was time to rewrite this whole system.

The project goal was clear, remake this system while:

Matching the current behavior exactly, for both data modification and “side effects”, like emails sent.
Not produce incidents. We don’t want to make a mistake that would end thousands of health coverage at a wrong date for instance. This would mean a lot of disappointed members. And our Customer Support, Engineers, and Product teams spending valuable time managing the consequences.

The current system had an extensive test suite, so as long as they still pass with the new system, we’re good. Right?

No. Because reality will always be more complex than what we could simulate in a test suite.

So how can we ensure our new system will behave as it should, in reality, before releasing it? We test it in prod, with a dry run mechanism.

How

We decided to implement a solution to dry run the new code in parallel to the old one and compare the results. This would allow us to make sure the new code behaves similarly to the old one for our most frequent cases. Once the new code was dry run against enough different cases successfully, we could use the new code in prod.

Now the question was: how do we compare results?

We’ve split the output into two categories: “data changes” and “side effects”.

Side effects are external or asynchronous tasks, for instance, settling a contract when we terminate it or sending an email when we cover a new employee.

Data changes are simple to “dry run”. We use a transaction-based database, we can just not commit the changes.

But for side effects, it’s less trivial as you cannot undo sending an email.

We made the assumption that if:

the data changes are the same
we trigger the same side effect function with the same parameters

Then the side effect would be the same (e.g. same email sent).

How we compared data changes

For legal and compliance reasons, we track all the changes made to our database. In practice, there is a table holding data changes information: which actor changed what.

Thanks to this mechanism we could retrieve what “the code being dry-run” modified in the current transaction, before rolling it back. We could also record what’s modified by the current system. We then compared the two sets of changes.

But it doesn’t always make sense to compare every attribute of an object.

For instance:

Our tables usually have a created_at and an updated_at columns. Their values should not be compared.

Solution: We built a list of attributes that are ignored during comparison.

IDs of newly created objects won’t be the same. This also caused issues when comparing foreign key values.

Solution: we built a mapping between the IDs of objects created by the new stack and those created by the old one. Then we treat them as pointing to the same object.

And of course, every new rule we added to the comparison logic had an exception, because of business rules.

Example: JournalEvents is a feature we’re currently deprecating. We first decided to ignore all changes in the corresponding table. Later on, we learned our Data team was still relying on it for a specific event, so we made an exception to not ignore that one.

The comparison code ended up rather complex with:

the base framework for retrieving the data changes
the rules to ignore some changes
the exceptions to those rules.

As this code was temporary, only being there for the duration of the migration, we were a bit lax about code quality.

How we compared side effects

Our goal was to be as little intrusive as possible. We didn’t want to update the business code for the sole purpose of comparing the two versions.

To do so, we leveraged Python’s decorators. They allow adding some behavior to a function with no impact on the code, by wrapping it.

Decorators are function wrappers. They permitted us to automatically run some code before a function, each time that function is called.

@SideEffectRecorder.record()
def send_termination_email(user_id):
    …

Each time the side effect is called, record will add the call to the list of side effects. In “dry run”, it will stop here and not run the wrapped_function.

class SideEffectRecorder:
    _is_dry_run: bool
    _side_effects: Optional[List[Tuple[str, Tuple, Dict]]] = None
@classmethod
def record(cls, function: Callable) -> Callable:
    @wraps(wrapped_function)
    def wrapper(*args, **kwargs) -> None:
        cls._side_effects.append((function.__name__, args, kwargs))
        if not cls._is_dry_run:
            return wrapped_function(*args, **kwargs)
        return None
    
    return wrapper

This way, we built the list of side effects called by each version of the code and compared them.

With both side effects and data changes comparison, we had a good tool, able to spot unwanted differences between our two stacks.

So what do we do when we have a difference?

First, what’s in a “diff”? Enough context to help debug the problem: An object is modified in one stack and not the other. Or modified in both but differently.

An example, for data change diff:

[User Lifecycle Revamp] For instruction RemoveEmployee, employment is updated differently between the two stacks.
current_stack_instance_changes
employment 132250341 [update {‘end_date’: ‘2023–02–28’, ‘onboarding_status’: ‘started’}]
new_stack_instance_changes
employment 132250341 [update {‘end_date’: ‘2023–02–28’}]

We can see that the current stack not only updates the end_date on the employment object. It also resets the onboarding status. Why? Well, we’d have to dig into the code to understand and then fix the new stack to handle this piece of logic as well.

Unless you realize when digging and discussing with your product team, we shouldn’t actually reset this status in this situation! In this case, you recognize you don’t always want to match the current stack behavior exactly.

An example, for side effects diff:

[User Lifecycle Revamp] For instruction RemoveEmployee, send_termination_email is run in the current stack but not in the new stack.
current_stack_side_effect
{‘user_id’: 123, ‘end_date’: ‘2023–02–28’, ‘termination_reason’: ‘resignation’}

We learned it’s a powerful TDD tool…

The initial goal for this comparison framework was to avoid regressions. There are so many edge cases, we know our tests were not covering them all.

Using it proved more powerful than that. It allowed us to understand the current system, without going through all the convoluted code. Especially as the code’s high complexity was the reason why we wanted to refactor it in the first place.

The existing automated test suite covered the most usual cases. Before dry running in prod, we added the comparison of stacks inside those tests. It helped build the first version of the features in TDD.

We didn’t need to understand how each feature works in detail. We could instead write a naive version and then use the tests to see where we were wrong.

We could be confident the new code would replicate the behavior of the existing one. We’re not only relying on some assertions in the test but really comparing all the changes.

The next step was to push this first version in prod, in “dry run” only. This will highlight the differences, from all edge cases that still need to be correctly implemented. Being able to ship the first version fast to get feedback on the missing edge cases is an efficient way to iterate.

Running this comparison for a few days/weeks covered many possible cases. After we fixed the issues gradually, we were confident we could use the new code.

… and we improved the product along the way

At first, we thought the differences raised would allow us to fix bugs introduced by the new code. But turns out it was more complicated than that.

It revealed several bugs in the existing code.

We sometimes decided to fix the issue in the current stack. Sometimes we just ignored the difference. We knew we could ship this part of the new stack¹ soon, and fixing would have been too time-consuming.

We also discovered that some behaviors of our product weren’t intentional. They were the result of an implementation detail. This led to discussions with our product team to make intentional decisions.

What began as a pure technical migration, evolved into bug fixes, culling of old deprecated features, and new product decisions!

[1] feature flags for the win!