Data Driven (Technical) Debt Reduction

Remitly
Remitly
4 min readFeb 3, 2022

--

Author: Ryan Burrows

In software development, we often use the term technical debt as a metaphor to describe shortcuts taken to more quickly get the product to a point where it can begin solving real problems. Like financial debt, it can be a valuable tool: borrow some development effort from the future to start earning today, with a plan to repay that effort (plus interest!) at a later time. Most developers I’ve worked with have had a similar concept of this, but the important, yet often unanswered, question is when should it be paid back? Our focus at Remitly is on creating peace of mind for our customers; although we may take on technical debt to be more proactive here on their behalf, how do we ensure these tradeoffs don’t come back to ultimately undermine that goal?

In the past ten years, Remitly has grown from a handful of developers working on a single codebase to dozens of squads, each responsible for building and maintaining their own portion of the product. We purposefully keep our squads small and flexible so that they can be organized to meet the ever-evolving needs of our customers. We also know that software systems will naturally want to be structured like the organization’s communication structure. Goals or focus may change as the business grows, at which point squads may change or split, system ownership may change, and technical debt or future designs may be lost in the process. In these scenarios, a data-driven approach enables teams to reflect and prioritize paying down this debt based on accurate, recent data, not only historical context.

At the beginning of this year, our team was established to address this topic for some of Remitly’s critical but aging monolithic systems. After preliminary analysis, we identified dozens of distinct business domains represented in these systems and set out to start decoupling the most impactful ones. There are many ways to use data to assess the value of technical debt — aligned with our values of customer centricity and our focus on creating peace of mind for our customers, we used likelihood and reality of negative customer impact to create a risk-based model.

Measuring the risk of technical debt

To decide where to focus our efforts, we first needed to figure out which of these domains to target. We identified two distinct categories to model in our data:

  1. Coupling — Highly coupled systems are more error prone as changes or performance issues in one area can inadvertently impact others, perhaps in surprising ways.
  2. Ownership — With so many domains represented, some hadn’t been a recent focus. With this loss of focus is an increased risk for bugs to be inadvertently introduced.

To measure the coupling of our system, we looked at file change frequency and the correlation between changes to that file and changes to others. Adam Tornhill, author of the excellent Software Design X-Rays, has put together a project called code-maat that we were able to use to extract this information from our projects’ git histories (specifically see the “sum of coupling” metric). We also used code-maat to identify which files have the most distinct authors — using that as a proxy for which files do not have clear ownership. We took both these metrics, along with how frequently files are to change and came up with a number to represent how much risk there was for errors in a given file.

A visualization of the coupling between some files (the real files have sensible names)

We took this concept and combined it with the size of the file, the complexity of the file (based on indentation), and the business domains that it was most related to, then derived a metric to represent the risk reduction potential of code in each of the domains we had identified earlier.

To understand how the code impacted the empirical reliability of our systems, we wanted to understand how impactful outages these domains would be. Remitly has maintained a COE process and kept track of these types of events; to estimate the impact of an outage in each domain, we reviewed the past year of these reports and made a table mapping the domain where the error occurred to the number of customers that were affected by that error, how long it took us to find and resolve the error, and the cost to the business as result of the error. We combined all of these together, per domain, to come up with a sense of magnitude for how significant an error in each domain would be.

Once we had these two numbers we were able to combine them (along with a few other factors, such as whether other squads were already working on decoupling this domain) to prioritize which areas we should focus on first.

Results

After completing this analysis we were able to publish a prioritization for decoupling these domains along with our rationale. The resulting list was both surprising and validating — of the top five domains identified, likely three of them would have been on top of our list based on instinct alone, whereas the other two were non-obvious.

Does this kind of work sound interesting to you? Come work with us — Our team is hiring and so are many others at Remitly!

--

--