The Big Refactor

Published in

THG Tech Blog

12 min readJul 1, 2020

Refactoring, made famous (and indeed coined as a technical term) by Martin Fowler in the 1990s, has not usually been considered a particularly pleasant task; particularly when working with legacy codebases.

There are many books on how to proceed with refactoring and working with legacy code and they all contain valuable nuggets of information. However each project is unique and there is no ‘one size fits all’ to making progress when deciding that your project’s code needs a bit (or a lot) of a spring clean.

The decision

The driving force for this particular refactor was the previous difficulty that the WMS team had experienced when attempting to split apart a monorepo.

The project had initially started with separate repositories for each of the components, however a decision was made to combine these repositories into a monorepo to speed up feature development and simplify the build process. This decision wasn’t taken for granted. At the time, the tech leads understood that it was a trade-off, particularly when it came to the coupling of components that could now more easily occur.

Over time the team became adapted to working with a monorepo and the tradeoffs were forgotten. Indeed any commentary around potentially splitting the monorepo back into separate component-based repos was met with protests:

“With a single repository, I can work across multiple components quickly to deliver this feature”
“Searching for code usages (in the IDE) is simple within a monorepo, how will that work across multiple repos?”

In short the team had adapted to the monorepo and had adopted development practices that took advantage of this structure. This meant initial resistance to splitting the repository was quite high.

Around two years ago, however, a strategic decision was made. This decision, to migrate the system from public cloud hosting to hybrid cloud (mix of public and private cloud technologies) hosting led to uncovering engineering challenges around the technologies in play in the WMS system and how certain components were coupled to both 3rd-party libraries and other components in a way that prevented a trivial “lift-and-shift” approach.

Analysis

With the decision made to start splitting apart the large monorepo, the next step was to determine where to start. A period of analysis was conducted and the following questions where considered:

Which section(s) of the code could be sensibly sliced out of the repository and into a separate git repository?
Which section(s) of the code, if moved into a separate repository, would provide the most benefit in terms of accelerating the move for the rest of the code?
Which section(s) of the code would be trivial to move as a proof of concept?
What are the follow-on consequences of moving the chosen pieces of code?

In the end (as in most things) there was a trade-off to be had between the triviality of moving a chunk of code and the value of performing the move or refactor in the first place.

The team decided that the biggest value to be gained was refactoring a module of common api classes, interfaces and Java beans. Moving this into a separate repository would improve the readability of the code, allow better versioning of components, improve build times and act as a proof-of-concept for separating out the rest of the code.

A staged approach

Having decided which section of the codebase was to be under the surgeon’s knife, the next steps were planned out:

Clean up build files to remove redundant dependencies (clearing the decks so to speak).
Take the section of Java code and split it down such that each package becomes a separate micro-library, retaining the code inside the mono-repo but reducing the amount of code that each of the clients is actually consuming from the interfaces and classes library.
Move the newly separated micro-libraries into separate repositories with their own versioning and build pipelines etc. Promoting them to first-class citizens within the WMS code.
Adjust the consuming code to import the versioned micro-libraries instead of the source code in the mono-repo.
Remove the redundant source code from the mono-repo (this is where the massive improvement in build times is realised).
Celebrate 🍺 (before moving onto the next refactor of course!)

With an approach sketched out it was time to turn to the code under the microscope: voyager-java-api

Opportunities & Challenges

When undertaking a refactor of something so fundamental to a large enterprise system, there are many opportunities that present themselves along the way:

“We should take advantage of the refactoring to upgrade to the latest versions of our dependencies”
“We can tidy-up the consumers of this code as we’re already adjusting it”
“We can delete/rewrite these tests as they’re inefficient”

Any of the decisions to go beyond the core goal of splitting apart the large module into smaller modules had to be carefully considered, and in almost all cases the choice was not to do any further cleanup than was necessary to complete the core refactor and ensure that the code still behaved correctly (you did start this process with good tests didn’t you??)

One of the very few additional changes we made as part of this was removing commons-lang and replacing it with commons-lang3 across the entire project. This small change was deemed appropriate given that it fell under the scope of ‘removing redundant dependencies’.

“Round and round we go”

Circular dependencies between classes in two packages

On the other hand, one of the biggest challenges uncovered during this exercise was discovering that some classes in separate packages within the module, when split into separate modules formed circular dependencies. The resolution to this was one of two solutions:

1. Move troublesome class creating a single direct of dependency

Move one of the circular dependencies across to the other package making one package a dependency of the other.

2. Create a sub-package to hold the dependency both other packages rely on

Or, make a sub-package that contains the dependencies that both consuming packages rely on.

The starting position including some ‘questionable’ package names

After a lot of mechanical moving of packages and classes (some automated refactoring within the IDE, but I found that more often than not this would miss some classes or would over-zealously rename classes). We achieved the following output:

The new list of “micro-libraries” — before moving out of monorepo

Consequences of a large-scale change

This isn’t the first huge refactor I’ve undertaken within this codebase or others. On each other occasion, there was a major challenge to keep a long-lived branch (containing the refactored code) up to date with the mainline branch — in our case master.

In the past we’ve had middling success with regular merges from master to the refactoring branch. Sometimes this is a smooth process, but more often than not there are conflicts. This difficulty in merging increases with change rate of the underlying code being refactored along with how long the branch remains open.

As a precaution I set a soft deadline to try to complete the initial step within a month. (I missed this by a week in terms of getting the changes into production, but I’m pretty happy with completing the work within 20% of the original estimate as I’m usually terrible at estimating).

I was also fairly confident that the code I would be refactoring was fairly stable and not subject to too much change by the rest of the team — since we had conducted some analysis previously on which modules suffer most churn.

Analysing source control history with Rust

A software project’s source control history contains a vast swathe of information that can be useful beyond merely…

medium.com

Still the result when it came to open the pull request was startling how much was actually changed:

hmmm

Needless to say the rest of the team were ‘supportive’ of the idea of the change itself, but somewhat dismayed at the prospect of the code review…

To make this more palatable to the team, I had split the work into discrete units which could be cherry-picked into master directly if needed. However at about 1/3 of the modules refactored out, the rest were so inter-related & entwined that there was no option but to move them all in one giant set of refactorings.

Luckily we had the pre-requisite tests (unit, integration and functional) to provide a very large and very much needed safety net for the entire operation.

Gotcha’s

Overall the process went about as smoothly as could be imagined for such a core change in a large scale system. After the initial move and fix of compiler errors, the suite of regression tests caught almost all the issues introduced with the refactor — all except one.

At the start of the process I had attempted to use the inbuilt refactoring tooling in IntelliJ to move & rename the code from one IDEA module to another, however after it had decided that names in strings in README and other text files were also relevant I decided to no longer trust the automated refactoring and slog through the move manually. However I hadn’t cancelled an automated rename correctly — leaving some template files with changed strings. These were lost in the giant heap of code and subsequently caused a failure when the code was deployed to the staging environment. (one of the more obvious dangers of such a large pull request)

Another issue was the rename of a core class that was also bound to the data model (something I personally detest and think should be excised from all projects). This led to integration tests working correctly when provided with an empty database and failing when working with a pre-seeded database. Discussing the change with the owning team we came to a resolution to the issue.

So that was it; more than 40k changes and only two bugs slipped through our testing net to get to the staging environment. One caused by human-error when relying on an over-zealous automated refactoring tool (my bad), the other arguably an issue of tightly coupling a domain model to a data model.

(or so we thought)…

Results & Thoughts

After a chunk of work like this, I like to take a step back and see what worked, didn’t work and what I can do to improve.

First the good stuff

The fact that I was able to undertake such a structural refactor of a mission-critical system was due to the amount of regression tests present. Without these, the risks associated with the change would have out-weighed any benefits.

The process of refactoring this code started off fairly simple and the initial idea of small discrete commits that could be cherry-picked back into master as a whole unit of work was something that I was happy with and would have liked to be able to use for the entire process.

The review process worked as well as could be expected with such a large-scale change. Separate parts of the pull request were reviewed by subject matter experts and the group consensus was that the change was fundamentally good.

A virtual meeting was held where the changes were discussed amongst the team and questions asked about specific parts of the refactor and the risks involved. Getting the wider dev team involved to understand the risks and how the change would impact their work was a good decision as it meant that when the code was merged there were no surprises for the team.

Now the problems

Obviously a structural change of such magnitude (despite the triviality of making the changes in any single area) are bound to carry risks.

In this case the risks were associated with two main areas — human error when using the automated refactoring in IDEA and state in flight during deployment of new code.

The first issue was quickly discovered on the staging environment where certain templates wouldn’t load as injected classes were renamed as part of the refactor. This rename was unintentional and was caused by the “refactor all occurrences” mechanism which I had mistakenly accepted when I should have limited it to class references only.

The second issue is something we were all cognisant of, as it was raised in the virtual meeting. Is the change going to impact any of the state in the caches? Despite asking the correct question and trying to check if there would be any impact on data at rest and data in flight, we couldn’t see anything that would cause any issues and our staging environment had shown that the changes were good.

A gradual rollout

Despite our confidence in the changes we still decided that the safest path was to only deploy the new version to a smaller installation where we could monitor the behaviour closely and build confidence over the questions around data in caches. This small installation showed no ill-effects of the changes and after a week of working with the latest version we were now convinced it was safe to proceed to the other warehouses.

Each warehouse software deployment proceeded as planned and team confidence in the code was high — until the final (and riskiest) warehouse. This last warehouse had a subtle difference in that it had a concept of a “liverack”, a gravity assisted roller-shelf fed from an automated storage and retrieval system — no other warehouse had quite the same setup. This one area of code had a change that involved renaming a class. Crucially this code is never exercised in any other warehouse, leading to a false sense of security over the correctness of the code.

Needless to say this was found by operations staff immediately after the release to the final warehouse. Unfortunately as this wasn’t a bug in the code so much as an issue with the way that data in one particular cache behaved, rolling back to the previous version of the code wasn’t an effective solution to the problem.

Luckily we had already built a tool to migrate data between caches to facilitate moving entire warehouse systems from public cloud to private cloud. So copying one cache within the same deployment environment was relatively quickly resolved.

Terraformin’

Building a private cloud requires a comprehensive infrastructure-as-code toolkit alongside the requisite hardware and…

medium.com

After determining what was causing the issue, the fix was effected via the tooling and we finally had this changed code running across all warehouses.

Process wins

Keeping the whole team engaged with the decisions during the refactoring.
Ensuring that the refactored branch didn’t drift too far from master (and not maintaining a branch longer than needed).
Ensuring the tests in place were adequate to give confidence in the process before starting.
Gradual rollout and monitoring.

Room for improvement

Despite the amount of testing, one issue was found in staging and one fairly large bug found its way to production — we need to invest some time in safety nets around the behaviour of the caches.
Breaking the refactor into more steps would have enabled a more gradual rollout of the code.
Smaller PRs are obviously easier to code review — although in this case the entwined nature of the starting code meant that this was always going to be tricky to manage well.

Next steps

At the point that the code was deployed across all warehouses, the astute amongst you will have noticed that we’re only partly finished.

With any large-scale change there is the inevitable tidy-up and this one is no exception. We still need to move the new modules out of the main git repo into their own repo. We also need to adjust the main repository code to consume the jar files that are the build artefacts from this new repository.

These final steps are the least risky parts of this particular change and now the path has been cleared, it’s simple enough to parallelise this work across the entire team over the next few weeks.

We’re recruiting

Find out about the exciting opportunities at THG here: