Wrangling a big ball of mud

Rudy
StashAway Engineering
11 min readDec 27, 2021

--

Back when StashAway first got started, the tech team was broadly classified into either “frontend” (consumer tech) or “backend” (trading). Consumer tech was in charge of anything directly facing the end users like the mobile app, web app, onboarding, account lifecycle etc. Trading handled all things financial like placing orders for trades, rebalancing, generation of statements et al.

We have come a long way since then. This is a story about the evolution of a codebase, and how shifting priorities from start up to scale up affect the emergent properties of code. So, go grab a cuppa and enjoy! ☕️

November 2016

The first lines of code are committed to the “app-server”. This is the monolith that a handful of devs are going to be working on furiously in an attempt to build Singapore’s first MAS licensed robo-advisory platform.

The pressure is high, timelines are tight, keyboards are clanking for 16 hours a day and the features are piling in. The devs working on the code are a tight knit colocated team maintaining high bandwidth communication and fast feedback loops. The web (React) app is being simultaneously developed with equal fervour and all requests are being designed to hit the app-server which contains all the necessary features for end users to interact with in the pilot.

The top priority is time to market so the common Singaporean can passively invest via dollar cost averaging instead of having to pay an arm and a leg as management fees to banks. 💸

July 2017

StashAway officially launches in Singapore as the first MAS licensed robo-advisor.

Since then, we have raised over 52 million USD in funding (latest round was series D), achieved over 1 billion USD in assets under management (Jan 2021), launched in a total of 5 regions and the engineering team size has grown from a handful in Singapore to a globally distributed team of over 55 developers.

What does any of this have to do with the evolution of a codebase you ask? It is imperative to understand the historical context within which code was authored. This context dictates the trade offs being made within constraints at the point in time and the emergent properties of the codebase as a result. More on this later.

Fast forward to Feb 2021, the “app-server”

With over 50 contributors, close to 6.5k commits and almost 2000 files of code, the app-server, the consumer facing trusty work horse was experiencing some growing pains.

We conducted a (socially distanced) tech leadership offsite in Feb 2021 and zeroed in on these challenges. Starting in March 2021, we got to work addressing these gaps. A small team of devs were performing surgery on the codebase while other squads continued to add to it. This required careful planning but we managed to emerge with only minor scuffs.

The challenges

The challenges were primarily around:

  1. Scaling
  2. Code structure
  3. Enforcing strong architectural boundaries
  4. Historical data modelling
  5. Owning multiple domains

Scaling

Scale here does not refer to traditional resource consumption like CPU and memory. The scale here is a function of a rapidly growing engineering team and how we all had to work with the app-server in some way.

Despite having µservices, teams were still coupled with the large historical monolith
Despite having µservices, teams were still coupled with the large historical monolith

As we grew, new squads strived for working in a decoupled manner by building their own services but could not fully step away from contributing to the app-server since a lot of core business capability lived there. This causes challenges for various reasons:

  • with multiple squads contributing to the same service, when there is a breakage in production, its hard to quickly diagnose which changes broke the service and which squad needs to take action. This caused a fracture in the ownership model and could have easily spiralled into “everyone owns it, hence nobody owns it”
  • one of the major benefits of having a µservices architecture is the ability to have independent deployability, which could be hampered in some situations if changes cross squad lines and said changes are coupled to the app-server

Code structure

When a tight knit group of programmers are rushing to hit “release”, there is less concern around structure of a codebase because every team member knows where to find things and restructuring the codebase will only end up taking time, slowing the team down and may even introduce regressions.

Hence, back in Feb 2021, the code was packaged by layer. So the top level directories represented the layers in the software and each subfolder was the domain within that layer.

Package by layer

This becomes tricky to navigate as the team size grows:

  • eg: a squad owning the “workplace” domain has to operate across the entire breadth of the codebase’s file system to make changes across layers
  • a new dev joining the org has no clear visibility as to what the codebase does and which domains it encompasses. Uncle Bob calls this trait “Screaming Architecture
  • even if we wanted to consider slicing individual services out, the progressive way of making such a change would involve first isolating the domains to reduce the blast radius. To paraphrase Kent Beck, “make the change easy, then make the easy change”

Enforcing strong architectural boundaries

This topic probably warrants an entire series of blog posts by itself but at a high level, it refers to the practices and developer discipline such as:

  • drawing clear architectural boundaries in code between layers
  • enforcing principles like inversion of control (dependency rule), loose coupling, high cohesion, information hiding (encapsulation), separation of concerns (single responsibility principle)

It is a topic that has plenty of great blog posts and must read books dedicated to it, so I will not even attempt to explain them out of the gross injustice that may ensue. Just know that we are standing on the shoulders of giants and this information is easily available for developers wanting to level up.

The challenge is essentially self explanatory, the evolving codebase chose speed over some of these principles and we arrived at the proverbial “big ball of mud. As Michael Stahl elegantly put it:

When you have a “big ball of mud,” you reach for the banana, and get the entire gorilla. — Michael Stahl

Historical data modelling

In an attempt to run as fast as we could in the early days, there were certain decisions and trade offs made around using an ORM which allowed us to go to market sooner. But this also introduced tight coupling between our core domain entities and the database. Today, we have new features in the pipeline which were previously blocked because of this tight coupling. To introduce more agility we were going to have to build new entities in the core of the system which can evolve independently from the shape of the data stored in the DB.

Owning multiple domains

The obvious elephant in the room. app-server owns too many domains and if the squad ownership model is to take on specific domains within the app-server, then we are probably better off letting them run independently without having to worry about conflicting changes mingling with their code.

This of course comes with trade offs (we love talking about trade offs internally) of what is today a method call within the service call stack will then move to a network call which brings with it fun things like handling failure modes, latencies, retries etc. It also raises a question if the effort for this transition is commensurate with the value it will generate.

The Solutions

Code structure

This is the first hurdle we decided to tackle. Refactor the codebase and restructure the folders from “package by layer” to “package by domain”.

Transition from “package by layer” to “package by domain”

The approach was popularised by Sam Newman as the “modular monolith”. This way the top level directories “scream” the behaviour and core domains while the subfolders are the individual layers that pertain specifically to the domain.

This approach would also mitigate some of the pain around multiple squads having to work on the app-server by isolating the changes they need to make within their domain of ownership. Since day 0, the team has always maintained high test coverage regardless of speed hence we had high confidence in the changes not causing regressions. 🎉

Enforcing strong architectural boundaries

Besides the modularisation effort, we started encouraging developers to start following some architectural guidelines when adding new code to the repo.

To facilitate that, using the strangler fig pattern and feature toggles, we rewrote one core domain using the desired architectural principles. This is a means of demonstrating that it is possible to follow clean coding principles regardless of domain while also providing other devs a template or reference implementation of new code conventions that are expected. In addition, we paired with devs working on new code to help immerse them in the new architecture. We did this to gather feedback from them as well as to start building the flywheel effect so they can do the same with other new devs.

We are even working on enforcing these architectural boundaries using fitness functions in the CI pipelines so that if devs break these boundaries in new code, the build will fail! 💥

Architectural boundaries and dependency rule being enforced

Historical data modelling

To overcome the limitations around tight coupling between the core entities and the database caused by the ORM, the new reference architecture stopped using the ORM. We now have full flexibility and freedom to evolve the core business entities independently of the database modelling. New features are now unlocked at a small cost of having to write and maintain a thin repository layer which fetches exactly what data is needed to fulfil the evolving shape of the core business entities.

Owning multiple domains

The costs associated with this are partially mitigated by the modularisation effort. As for actually splitting some of these core domains out to individual services, we are going to revisit the topic internally in 2022.

Fixing broken windows

When working with a codebase that has some age, it is very common to find evidence of bit rot. The basal cost of a software system is very real and if not kept in check can run up a serious amount of debt over time. Some examples:

  • old libraries which we could clean up via tree shaking
  • patch and update all libraries to set the new baseline
  • despite good test coverage, the test reports for individual regions did not generate an aggregated coverage report for the entire codebase
  • missing lightweight architectural decision records (we now use adr-tools)
  • speed of pipelines to get faster feedback (we improved speed by ~2x)
  • comments like “need to clean this up” which were committed 3 years ago (LeBlanc’s Law: Later equals never”)

The Pragmatic Programmers’ book refers to these symptoms as the broken window theory. I like to remind developers I work with from time to time about the rule of boy scouts:

Always leave the campground cleaner than you found it

Improve DX

In addition, we decided to modernise the codebase a little, by introducing TypeScript for new code, but decided to leave the vast majority of the older ES6 code untouched for now.

We even automated the entire process of bootstrapping a new developers’ machine via a single script using tools like asdf, homebrew and direnv to name a few.

A lot of these changes were done along the way via housekeeping tickets sprinkled in each iteration and we even used an entire sprint just dedicated to custodianship of the repo for heavier lifts like upgrading libraries with breaking changes.

A balancing act

For all the changes mentioned above, we are constantly red teaming each others decisions so as to avoid working in an echo chamber and validating that the effort we put in is going to have proportional returns in value.

After all, on one extreme, you could argue that this is battle tested code working in production for years and hence does not need fixing. 🤨

A rising tide lifts all boats

One advantage of having a service like the app-server with a large circle of influence over the dev squads is that we have the ability to leverage the changes being made to drive education campaigns. We have used the new template architecture to run sharing sessions and influence other squads in adopting patterns of thinking about layers in software and the patterns of interaction between layers.

Reflection

Somewhere along the way, we seem to have crossed the sweet spot of running fast while maintaining good design in the app-server. We have taken adequate steps to correct that now.

Is this bulk cost later in the cycle better/worse than having incrementally absorbed it along the way? It’s impossible to know.

What I do know is that decisions made upfront about trade offs were carefully calibrated to leave options on the table in the future. A great example of this is the emphasis of maintaining a good testing culture within the engineering department.

Yet another critical factor is the technical leadership’s ability to articulate the value of such work and their willingness to empower the team by prioritising it instead of just providing lip service to the cause. By deliberately assigning headcount and agency to improve the situation, the medium-long term outcome is a safe and trust driven culture within the engineering department.

So instead of dwelling in the past about what could have been, look at strategies and techniques that can be applied now onward to fix broken windows and lead by example. 💪

2022 and beyond!

In 2022, we hope to revisit the topic of splitting out some specific domains from the app-server and the exciting challenges that would entail.

As an aside, the trading side of the aisle (which I only briefly mentioned at the top of the article) has some even more exciting work in the pipeline.

We are always looking for talented engineers to join us on our journey and we even offer fully remote positions! Check out our careers page. 😉

--

--