You don’t need to stop shipping features to fix technical debt

Published in

Humans of Xero

12 min readApr 16, 2020

(This post is also published as presentation: video here / slides here)

If you’ve ever worked at a high-growth software engineering company, you might have heard a conversation about technical debt that goes like this:

Person A: “We could release so much faster if we just did X…”
Person B: “That makes sense — so why haven’t you done it?”
Person A: “Well, with all this feature work, we just don’t have time.”

Or comments like this:

“We really need to do this. We should just stop all feature work until we fix this.”

I’ve been on both sides of the fence with these situations — and when I’ve been on the receiving end (Person B above), fixing the problem is almost always well justified with sound logic yet doesn’t always get the “buy-in” it needs!

So what’s the problem here? Why is it so hard to find time, or to convince people to allow us to start paying down tech debt, when it’s so obvious and logical that it’s the right thing to do?

Are we defining Technical Debt correctly?

This vintage buzzword “Technical Debt” (or “Tech Debt” for short) is commonly used as a catch-all term to describe things we wish were better about the software we built or are working on. The software usually isn’t in the desired state because of a shortcut or trade-off made at an earlier point in its development. Some people make comparisons to monetary debt here — but this is software we’re talking about so I‘ll refrain.

In almost every case, things that get labelled Technical Debt are non-functional changes to software. Things like refactoring code to eliminate globals in favour of dependency injection, fixing (or adding) tests, improving the docs, or even getting that test suite to run in CI or finally automating your deployments.

Because these things are not functional changes and they don’t change the behaviour of your software, the improvements will likely not be visible to your customers. So if your customers won’t see any changes from the hours/days/weeks of work you want to spend on this Technical Debt, what value are you delivering? And why should your time be invested into this?

Communicating value to decision makers

Communicating the business value, or value to the customer, of technical problems is the key to getting time allocated to solving them. A common blocker some people have here is a fixed mindset (“we would like to, but we just aren’t allowed to fix this”). Remember that in any organisation, people make decisions — decisions are about trade-offs, and decisions can be influenced, overturned, and changed. More importantly, organisations evolve— the way decisions were made yesterday might not be the way they get made tomorrow!

Recommendation: One of my favourite pieces on the topic of decision making is “Making the Right Decision Every Time” by Ben Fathi.

If communicating the value is the key, how should you communicate the value of paying off tech debt? There’s no one-size-fits-all answer to this question — but my goal for this article is to outline a framework to help product/engineering teams classify issues, allocate time, and have clarity around why each bucket of time is appropriately sized based on the value it will deliver.

I hope that this framework will help you align with people/”stakeholders”, allow you to more effectively communicate value, and allow you to finally solve that Technical Debt you’ve been wanting to address!

The easiest way to solve Technical Debt

“The easiest way to solve a problem is to deny it exists.” — Isaac Asimov

Before you attempt to use the framework I’ll present, for every Technical Debt issue you have, ask yourself: is there really a problem here? It’s likely not worth paying down unless:

It improves the speed or quality of output e.g. reworking that build configuration will cut our release times in half, allowing us to deploy twice as often.
The cost of delay is urgent e.g. if we don’t fix this right now, it will directly impact X customers/Y% profit/Z% revenue.

If it doesn’t fit into one of the two categories above, hopefully you can just ignore it, and if not — we’ll get to that.

An Engineering Allocation Framework

At each level of planning at your organisation, for whatever your planning cadence/duration might be, you should be able to classify or allocate all work into one of these four buckets:

Functional Improvements — deliver functional changes which your customers see or use. Often a Product Manager/Owner will be heavily involved in how this bucket should be prioritised, based on research and input from customers. Example: shiny new features.
Program Work — part of a larger body of work that is driven outside of your team/happening within your organisation, often coordinated by a Program Manager who has flagged your team as a dependency for that program of work. Example: platform-wide localisation/internationalisation efforts.
Delivery Performance — increases delivery speed or quality of output in a quantifiable way. Example: automating your deployment process, so you can get code from your trunk/master branch into production faster/more frequently.
Unplanned Work —when the cost of delay is urgent, you prioritise this work and fix the issues immediately/alongside everything else that you have going on, rather than deferring them to your backlog/a future cycle. Includes bugs, customer support requests, production incidents, etc. Example: production outage affecting 20% of customers.

By classifying your work into one of those four categories/buckets, communicating the value of the work should become easier, and making time to work on your tech debt issues should be easier.

Note that the above classifications should apply to your entire backlog — not just technical debt issues — because allocating time to technical debt issues means not allocating time to your other work. Also note that technical debt will likely be classified into either Delivery Performance or Unplanned Work.

Hierarchies of planning

Above, I said “at each level” for a reason; being able to communicate with a consistent vocabulary across your organisational hierarchy will help you influence decisions to invest in solving Technical Debt issues.

Many tech companies use a model where each squad/team/“two pizza team” is often responsible for managing a backlog which is influenced by an organisational hierarchy. If that’s how your organisation works, having a hierarchy of decision makers understand this framework will help align you on terminology, logic, and justification.

Unclassifiable technical debt issues

Like any model or framework, there are edge cases that don’t fit into its structure. At first glance, you might not be able to ignore (“not do”) an issue, and you may not be able to classify it into Delivery Performance or Unplanned Work. Please don’t give up at your first glance — push yourself to find a way to justify issues into one of these by working through the details of the issues just a little further (or with a peer/SME). If after a second try it still doesn’t fit, that’s ok.

“The code is hard to test” — this is an example of one of those tricky scenario’s, where it might be hard to just ignore it or justification for how you’ll classify it is unclear. In these cases, I’d suggest the following approach:

Make small, incremental improvements to your code — don’t try to ask for time to work on a “big bang” project to solve it.
Every Pull Request (PR) should improve your code, improve your tests, and improve test coverage (or at the very least, not regress!). Quality gates in your CI system could even enforce rules such as not letting test coverage drop by more than X%, and requiring a minimum Y% coverage.

If that doesn’t seem like a tractable approach, you may need to find a way to redefine the problem such that it’s broken into smaller chunks — and maybe some of those chunks can be ignored, some can be classified, and some can be worked into the pay-as-you-go (PAYG) approach mentioned above.

Allocating Time

Once you’ve classified work into one of the four buckets, the next step is to “slice the pie” or allocate estimated time for each bucket.

An example breakdown of these four buckets would be:

40% Functional Improvements
10% Program Work
20% Delivery Performance
30% Unplanned Work

Determining the right allocation percentages

Here’s an approach I’ve used successfully: start by nominating numbers, reviewing with stakeholders, then adjusting where appropriate.

You cannot successfully allocate percentages without involving stakeholders — largely because those percentages are often negotiable but also involve many decisions and hence trade-offs.

In what order should you nominate numbers?

The first number to establish is Unplanned Work; this shouldn’t be something you decide, but rather it should be anchored in data you have which allows you to extrapolate what you’re expecting for the forward-looking time period. For example, if you’re planning quarterly, you might look at the time spent in the last quarter, and (particularly for seasonally-affected businesses) the time spent for the same quarter in the year prior. If stakeholders try to negotiate this number down, walk them through how you established the number and help to understand the reality of your situation.
The second number to establish is Delivery Performance. Why? Based on the framing above, we want to start with an ambitious but well-grounded number. Everyone should be motivated to reduce Unplanned Work, and Delivery Performance is where that can happen (more on that soon). If your Unplanned Work is “too large” (each business will have a different level of tolerance) consider increasing your Delivery Performance allocation until that number becomes more acceptable. If you adjust this number down, make sure you communicate that trade-off with the relevant stakeholders.
Recommendation: the Google SRE Book topic on Embracing Risk has some great thinking in the area of identifying a risk tolerance within your business — this might be helpful when trying to establish how much time to allocate to your Delivery Performance bucket. While using an Error Budget might be a highly effective option for some teams or organisations, agreeing to a system for managing Error Budgets and implementing/measuring SLOs is often a challenge in itself. The allocation framework I’m presenting in this article has some overlap, but in some ways might offer an alternate path for allocating time to address some quality/reliability issues that would otherwise be addressed using Error Budgets.
The third number to establish is Program Work. These asks are often coming from other teams or parts of your business, some are more negotiable than others. Try to size this bucket based on what you believe must get done for the next duration.
The last number is Functional Improvements; by this point, you’re done sizing because you’re working with the remaining percentage. You may have heard the term Value Streams — where activities of teams should align on delivering requests of its customers. This bucket is usually the main one which directly solves for the things your customers want to “see”. If the size of this bucket is too small for the value you believe you need to deliver, or too small to meet the expectations of your customers in terms of new feature delivery— we’ll cover the lever for getting more time for this bucket shortly.

Once you’ve done the initial sizing of each bucket, you’ll want to consider how that could be tweaked/adjusted based on some discussions with stakeholders. There are two “levers” within this framework to consider:

Delivery Performance work should reduce Unplanned Work
Program Work impacts Functional Improvement Work

Reducing Unplanned Work

Ideally any effort classified as Delivery Performance will reduce the amount of Unplanned Work in the next planning period or cadence of work.

Recommendation: Check out the Accelerate book by Dr. Nicole Forsgren, Jez Humble, and Gene Kim — an awesome guide to measuring and thinking about improving software delivery performance, with an excellent definition of Software Delivery Performance and 4 key metrics of Lead Time, Deployment Frequency, Mean Time To Restore (MTTR) and Change Fail Percentage.

Delivery Performance can impact Unplanned Work on two fronts:

Delivery Performance work should reduce the MTTR (Mean Time To Restore). Even if the quantity of bugs or issues in the Unplanned Work bucket doesn’t reduce immediately, the time it takes to solve them should — thus reducing the overall time required to be allocated to the Unplanned Work bucket.
Delivery Performance work should improve quality of output, and therefore reduce the Change Failure Rate. Ideally we want to see the number of issues contributing to our Unplanned Work bucket decline over time, and improving the quality of output (in terms of accuracy, reliability, performance, security, etc) should do exactly that.

Depending on the context, sometimes it’ll make sense to prioritise reducing MTTR over reducing Change Failure Rate — e.g. if your ability to restore takes days, and the number of failures is small yet constant, you might deliver more value to your customers by opting to reduce MTTR and resolve those future issues more quickly, before you opt to reduce the number of failures they see, because the length of an outage or bug might be higher impact than the quantity. The aim is to use these two variables but frame the decision process in terms of value for the customer.

The impact of Program Work

I consider the balance of Program Work and Functional Improvements an allocation lever. By working with Program stakeholders sometimes you are able to defer or reassign work that your team is being asked to do — so long as the reasoning for that decision is well articulated, and those stakeholders agree to make that trade-off.

Programs are often large and complex bodies of work that consist of multiple projects, teams and dependencies. It’s important to balance things so you’re anticipating blockers and avoid being a blocker on other teams. It is very easy for a Program Manager or driver to ask your team to complete dependent work early — but from your perspective, pushing that work back one week/month/quarter could mean a world of difference to how you can allocate time, and ultimately the value your team can deliver to your customers.

Given that Unplanned Work is somewhat fixed, and the lever for reducing that is largely Delivery Performance — when trying to find more time to allocate to Functional Improvements, I’d suggest working with Program stakeholders to allow you to reduce your allocation to the Program Work bucket. You should be able to achieve this by explaining the bucket allocation you have, and why you think pushing back on their ask is the best thing for your customers right now.

Because of the way this lever is structured, the trade-off they’re making is literally shipping value to the customer specific to your product instead of working on a larger initiative (which, in itself, may or may not deliver more visible value to the customer).

Maximising the Allocated Time

Once your team has finished classifying a backlog into one of the 4 buckets, and allocated a percentage of time to work on each bucket, how can you actually execute on that?

Each team is going to be different — factors like team size, experience levels, tenure/subject matter expertise, etc. all affect decisions of who works on what.

However, there are some things you should consider when dividing the time allocated to your team up between individuals (or pairs, in the case of pair programming environments):

Everyone in the team should own delivering value to the customer.
Everyone in the team should own the quality of output. This includes the accuracy, reliability, performance, security, etc.
Letting a SME continue to focus on one area just hides/defers risk and costs — e.g. lack of business continuity if they’re away/sick/unavailable, cost of knowledge transfer for when they leave, lack of knowledge transfer if they don’t off-board properly/fully, etc.

Sometimes you might trade-off deferring risk for getting things across the line quickly — that’s ok but you should make sure your team/stakeholders are aware and aligned on that decision.

Measurement

If you do consider adopting this framework, I’d really encourage you to think about how you can measure your success before you start.

For example, if you’re using an issue tracker like Jira, you might consider adding a field or label that lets you assign issues to one of the four buckets.

Ahead of your next cycle/sprint, you should be able to quickly report on what your intended breakdown is — and after that cycle ends, you should be able to analyse how you did. One question that often gets asked here is whether to track the total number of issues, or the sum of estimations (such as time or effort) — that really depends on how your team works and what you think will offer a better return on effort. Don’t aim for a perfect system — just choose something you believe will provide enough utility to give you visibility into how you’re tracking.

Labelling and tracking how you’re progressing on each bucket is especially useful for Unplanned Work — and for that you might look at it on a more regular cadence than once-per-cycle to understand if it’s impacting your cycle significantly more or less than expected.

Continuous Improvement

I’d like to wrap this article with two final thoughts…

First, organisations/teams/backlogs all evolve, and so should your approach to planning and time allocation. Using a continuous improvement mindset, making measurement/data-driven decisions a part of the way you work, and running rituals like planning/sprint retrospectives will help you to evolve your time allocation for the better.

Second, in the spirit of continuous improvement: if you found this article helpful or a hindrance, feel free to reach out — I’m keen to hear what is or isn’t working for you. Thanks for reading!