What we learned from a large refactoring

Published in

Alan Product and Technical Blog

7 min readJan 12, 2022

At some point in the lifetime of a product development comes the moment where software engineers have to break old assumptions and replace them with updated ones. At Alan, we are undertaking a large refactoring to update one of the oldest assumptions in our code base. As we are nearing its end, we gathered some learnings during our journey on how to achieve this kind of change.

*Phases of a refactoring aiming at updating the data model*

A bit of history

Alan is an insurance company: it provides health services to companies. In practice, this relationship between Alan and companies is represented by a Contract. Since Alan’s beginning in 2016, the product has been built on a core assumption: a company can have only one single contract at a given time. At that time, all our customers validated this hypothesis.

At the end of 2019, we started having a few companies needing two contracts. They wanted to provide different insurance contracts to different subset of employees. Since we didn’t have enough volume to justify a lengthy refactoring, we kept the single contract relationship and managed to handle these rare cases using a hacky way: by creating two companies to represent a single “real” one.

Starting 2021, the number of companies with two contracts has greatly increased. And the technical debt due to having two companies representing the same single real-life company has become hard to maintain. At the same time, our sales team signed a deal with a large company that needed not only two, but eighteen different contracts. This was time to challenge our past hypothesis and move to a multiple contract world.

Thus our journey began

Allowing companies to handle multiple contracts was prioritised and placed in Alan’s product roadmap. A team was created to fulfil this purpose, and I joined it right after its creation.

First things first, we aligned on how we should update our data model to move from a one-to-many relationship between Contract and Company to a many-to-many. We had a one-to-many relationship initially, as a company could have past “non-active” contracts.

We introduced a new model ContractPopulation to link companies and contracts. In a short time, we managed to implement it and make it live in parallel with the old code: the “multi-contract” was enabled.

All we had left to do was to migrate all the code relying on the old paradigm. A refactor of this scale isn’t an easy task for various reasons: impact and organisation.

Impact

Challenging the assumption in the data model was the easy part. Yet, the impact of this refactor is larger and touches various dimensions:

In our codebases: the single contract paradigm is deeply anchored. For example, in the backend, one of the legacy properties to deprecate was called 567 times. It is then used multiple times by other ones, and so on …
In the product: many features were introduced in our app, all built assuming that companies had a single contract. It allowed us to simplify the UX and impacted the design of the app.
In our “mental-model”: it’s rooted in the mind of all Alaners (software engineers, product managers, designers, …): we all used to think “single contract”. And our knowledge base was written based on this assumption too.

Organization

Changing a paradigm takes time. Since we started this mission at the beginning of 2021, new features have spawned, most of them relying on the contract of a company. There was basically a race condition between all engineers working in the same part of the stack (more than 40 people).

Also, we chose to dedicate a team for this refactoring. The goal of having a unique team working on the migration was to:

Centralise knowledge and decisions.
Move towards a uniform solution.
Make staffing easier.

However, our code base is large, and many parts require engineers to be knowledgeable on how this specific part works. Each area at Alan is handled by people who are specialists in this particular codebase. Because of this, having a single team handling the updates provided additional challenges: each time we tackled an area, we needed some time to ramp-up, so we could learn how the multi-contract would impact it and not break anything. It also required some back and forth with knowledgeable people to validate our changes.

A journey full of learnings

We are nearing the end of this large refactoring: it took us a bit more than a year! And in doing so, we gathered some learnings on how to handle a migration of this scale.

Involve impacted people early on

You should communicate as early as possible: to make everyone aware of the paradigm switch and to avoid introducing regressions. Two tips:

Repetition is key: do not hesitate to over-communicate.
Target an audience: people feel more involved when they are addressed directly.

You should include people from all disciplines related to product development: not only engineers but also product managers, and designers so they take it into account when working on new features.

For software engineers, as Emma said in her article on another type of refactoring: communication works best when coupled with comprehensive documentation on how to start using the new tools and linters to help them fall in the pit of success. For example, we’ve added a new linting rule to prevent engineers from using a deprecated property. In the error message, we explained how to fix it and we provided a link to our documentation to give more context.

Leverage monitoring tools

First, tools for observability. It is important to have a better vision of where you are going. You can have various use cases:

To monitor the progress of your refactor, by observing the volume of calls to deprecated patterns.
To estimate usage of each part of the codebase so you can prioritise deprecations to fix.
To compare new implementations against old ones before doing the switch.

We use Datadog and we built monitors and dashboards for each use case. For the last case, we used a library called laboratory in Python, which allowed us to observe behaviour of both implementations (the old one and the new one) and fix any found inconsistency.

*Monitoring calls of patterns to deprecate*

Second, tools to catch regressions and any new usage of deprecated patterns. In our case, we had setup a few daily routines:

Reviewing pull requests impacting our area before they are merged — thanks to Github code owners.
Checking data consistency with SQL queries — using Metabase and its Pulse feature.
Following errors and fixing them — with Sentry.

Define a clear strategy upfront

You should settle on a plan to define your path throughout the project. You can use a Top-Down approach: meaning you start by listing features you need to change and use them to scope your work. Or the other way around, a Bottom-Up approach: you start from the parts in the code you want to deprecate and you move upwards. Both strategies have their pros and cons.

A feature-driven strategy is helpful to end up with a known and well-defined scope. Also, non-technical teammates (product managers, designers, …) can lead and contribute to the scoping. However, it’s easy to miss a spot (not all features of the product are documented). Plus, it could lead to some inconsistencies: by tackling the code related to each feature alone, you may make local choices that can be harder to reuse for other features.

On the other hand, a code-driven strategy is a global one. So it helps to obtain a more consistent solution in the end. It’s also helpful to discover hidden specs and constraints early on. Yet, when you’re deep in the code, it can be hard to evaluate the impact on the features. And it’s easy to introduce bugs, especially when changing core properties that are used everywhere (that’s why monitoring tools are handy).

We chose to combine both. We split the work into two groups:

One to handle features and related changes.
A second one to create a unified API and make sure nothing is broken.

Looking ahead

During this year, we learned a lot:

How to optimise collaboration between engineers.
How to make sure everyone falls in the pit of success.
How to construct an optimal solution when so many features are involved.

We are nearing the end of our mission, and we won’t have migrated 100% of our codebase. Hence, besides applying the migration, the hidden goal of our team is to make sure that the new paradigm is understood by all and to leave the right tools for anyone to fix any missed spot.

Some questions remain open: Was having one team to handle all the migration the best setup vs. dispatching the load among teams of experts? And in general, when would be the right time to pause feature delivery and focus on fixing technical debt?