The Tech Debt Playbook

Published in

Geek Culture

10 min readJan 26, 2021

All software teams have technical debt — parts of the code that weren’t created with today’s challenges in mind, or were written poorly, or were expedient hacks that are now problematic. Having tech debt isn’t necessarily a bad thing; if people spent all their time making code perfect, nothing would ever get done. But too much accumulated debt makes it slower to deliver new features and is a source of bugs and quality issues.

What should an engineering leader do when a team is mired in tech debt? This post lays out the the general playbook I’ve followed. Many people have faced similar hurdles, and there are established best practices and accumulated wisdom to draw from.

The bad news is that addressing tech debt is a long, slow slog. People outside of engineering will struggle to see the value of things like “refactoring” or “data isolation.” They will need to be convinced to allocate time to things other than new features which add direct customer value.

The biggest challenges to addressing tech debt will be cultural, not technical.

The biggest challenges to addressing tech debt will be cultural, not technical. Engineering teams need to accept more mature processes, organizations need to commit to a shift from operating reactively to deliberate planning, and everyone must truly get behind addressing tech debt as an investment in future team capacity.

Rally leadership around tech debt

Tech debt manifests as engineering projects that take longer to ship than expected and quality being a problem. This almost always goes hand-in-hand with frequent interruptions from urgent bugs, operational problems, and customer support escalations. There may be an impression that engineering is in disarray and isn’t doing a good job writing software. But the root problem is that the system has grown too complex and haphazard to be manageable. It will require dedicated effort to untangle and clean up.

The system has grown too complex and haphazard to be manageable

Growing teams may face an additional organizational challenge of a culture where everything is a last-minute scramble. Sales may drive product development by demanding urgent features to meet commitments. The CEO may expect to set the engineering team’s focus on near daily basis. The mindset that all tasks are urgent and immediate encourages quick-and-dirty hacking to meet tight deadlines. Heroically responding to sudden changes is what gets rewarded. Plans tend to be sketchy at best, and process is undervalued and ignored. While any organization must have the ability to confront urgent changes, the question is whether operating reactively ought to be the norm.

You will need to convince your leadership that the engineering machine is in danger of grinding to a halt. The team wants to be productive but is constrained by the accumulated weight of maintenance costs and edge cases from past decisions. If people want to make this better, they must commit to these things:

Acknowledge that tech debt is serious and real. We will need to invest considerable resources to make our systems more manageable. These tasks must be prioritized alongside product features, and sometimes we will choose tech debt over feature work.
We want work to be deliberately planned. Last-minute scrambles do happen, but we want to avoid operating that way. As part of this, sales teams and the CEO will support a process for planning work and let go of directly controlling day-to-day engineering assignments.

Establish best engineering practices

Engineering teams should adopt best practices for shipping high-quality code and communicating effectively. This level of baseline maturity is a prerequisite for tackling tech debt initiatives so people don’t get lost in the weeds, and also prevents unsustainable hacks or errors from getting into the codebase. Set standards like:

All engineering work should be recorded in an issue-tracking system like Jira. Other departments should use this system for requesting engineering work. Engineers working on even modest tasks should track that work so there is visibility.
Tasks are prioritized by the engineering team. Usually the product manager and the engineering manager define the relative priority and sequence of detailed tasks, consulting with stakeholders regularly about the higher-level priorities and major initiatives. It is bad when engineers “go rogue” and choose to work on projects that aren’t a team priority. And it’s even worse when someone outside the team “swoops in” to insist that a particular task be done without working with the team to prioritize the work in context and ensure it’s well thought through.
All code changes require a peer code review.
Commit messages and pull request descriptions should be thoughtfully written to give context and describe the change. A code reviewer should reject descriptions that are cryptic or one-liners.
All pull requests must include automated tests, or the comments should address why tests didn’t make sense.
Beyond automated tests, developers are responsible for doing a pass at QA to double-check a feature end-to-end.
Projects have a definition of “done.” The developer is responsible not just for the code but any data migrations, rollout, etc.
Developers should only work on one task at a time. Limiting work in progress is proven to help teams maintain better focus and more reliably deliver work.

Metrics

Choose essential metrics that show progress and help planning. Start with:

Cycle time: how much time elapses from when a developer starts work on a task, to when it is done? Finding ways to reduce cycle time aligns well with changes that improve engineering as a whole — easier deployments, smaller task size, and of course less technical debt bogging down development.
Regression rate: what percent of bugs are caused by recent code changes? A high regression rate either means developers are being sloppy or that a part of the code is dangerously complex or tricky to work with.
Sprint velocity: a count of stories (or points) finished in a sprint. This is used to predict the team’s capacity for the next sprint, and to see if capacity is increasing over time as tech debt is cleaned up. Note that velocity is a relative measure with no inherent meaning; it is not a way to score how good a team or person is, and it’s meaningless to compare two teams’ velocities because their planning and nature of work are different.

Build a list of tech debt projects

Maintain a living document that lists all your possible tech debt projects. For each project, include a brief description, the benefits, and a rough sense of effort. When there are incidents or regressions caused by unaddressed tech debt, add a note to the corresponding item to bolster support for fixing this.

Building this list will be surprisingly easy. Some of your engineers probably already have notes along these lines. Survey your team, review your backlog, and you’ll start to see patterns. Sharing this list with your team will help validate that their concerns are being heard.

Set expectations that engineering will be gradually chipping away at tech debt for months or even years. Be very judicious in how you prioritize work; start with smaller projects that yield easy wins to establish trust.

Tech debt projects are notoriously difficult to scope and to track progress against. You don’t want an engineer to dig into a massive re-write and have them disappear for two months; instead, you want there to be a plan with discrete milestones. Define a “pre-project” phase for an engineer to investigate a problem area and write up a proposal that will be used as the basis for estimating work and managing the project.

Commit capacity to tech debt

Work with your boss and product manager to budget what percent of team time will go into each of these areas:

Ongoing support: how much engineering time to dedicate to bugs and operations. This is analogous to the interest payment on your tech debt.
Tech debt initiatives: capacity dedicated to clean-up. This is paying down the principal on tech debt in order to free up future team capacity.
Features: adding new customer value to the product.

When a team deep in tech debt does this accounting realistically, they may realize that their support overhead alone is consuming the lion’s share of their capacity. Unless they invest almost all their remaining capacity to paying down their debt, they will drown. Explaining the reality of the situation in terms of a “capacity budget” may help manage uncomfortable conversations with stakeholders about why new feature work will be limited.

Start the clean-up with DevOps-ing

The most effective starting point for cleaning up tech debt often isn’t the code itself, but infrastructure improvements around deploying and monitoring code. These have an outsize impact on cycle time. Worthwhile goals are:

Deployments are automated (“push a button”) and reliable.
Monitoring is in place to know immediately when something goes wrong, and to quickly diagnose the root cause.
There is automation to regularly run unit tests and end-to-end tests. Even running a small handful of tests once a day is better than nothing.
Database schema changes and data migrations follow a well-controlled process, rather than people making ad-hoc changes.
You have some form of configuration management, with settings checked into source control.

You don’t need a super-advanced setup to get started. If your tests aren’t in good shape, don’t block your DevOps efforts waiting for teams to add test coverage; go ahead and focus on getting deployments and monitoring in good shape.

On-call rotation and bug rotation

Interrupting engineers hurts productivity. But someone needs to be available to fix urgent issues and answer questions.

The standard practice is to introduce an on-call rotation. Each week, a different engineer on the team is responsible for fielding all interruptions and urgent fixes. The goal is that the rest of the team can focus on planned work. Because the team is on the hook for the code it writes, there is real skin in the game for preventing problems. The sprint should be planned without any expectation that the on-call will have time for “normal” project work — that would be a happy bonus.

Many teams struggle with a long backlog of bugs and customer support requests. Inevitably, important features are prioritized over bugs, and the backlog only gets longer. Teams need a way to allocate protected capacity for working on bugs. One solution is to ask the on-call to work on bugs during their downtime. If that’s not sufficient, teams may consider setting up a separate bug rotation so someone is 100% dedicated to the bug backlog each sprint.

Putting it together — an example

As an engineering leader, you will need to spend a lot of your time selling people on the plan to address tech debt. You’ll need to remind people that ongoing support and tech debt are real costs, and part of engineering’s job. Reinforce that we as a company have decided tech debt matters, so clean-up tasks aren’t an indulgent engineering romp.

In addition to a more detailed quarterly roadmap that explains the rationale projects and estimates, I highly recommend sharing a distilled explainer for where engineer effort will be going broken into three tracks of work: support, tech debt, and features. Share this regularly and repeatedly, especially at all-hands and team meetings. Here’s a made-up example:

We have 7 engineers. For Q1:

Two engineers a week are on support:

One engineer a week is on-call
One engineer a week is on bug rotation

These internal initiatives will make our system more stable and lay a foundation for upcoming projects: refactoring our scheduling system, moving software deployments from Elastic Beanstalk to ECS, and making user permissions a standalone service. We estimate this work to be about 40 engineer-weeks, or 3 dedicated engineers a week for the quarter.

This gives us capacity for about 2 engineers a week on features. We expect to roll out a partner integration with Sesame Co., add support for Outlook calendars, and provide customers with an internal dashboard showing their top performers and under-performers by market.

Don’t let the approach become the purpose

I’ve seen cases where a team addressing tech debt adopts an unnecessarily extreme technical position because they took their approach as a mantra, and let it became the goal. For example:

Let’s break the monolith into manageable chunks → everything should be a microservice.

The attention on “microservices” obscures the benefit we are seeking, which is to isolate code and data into isolated components so a developer can make changes quickly and without unintended consequences. Perhaps there are data that must be tightly coupled, so separating out a component as a larger service is appropriate. If a team bangs on the “micro” part of the “microservices” drum too loudly, that may lead to poor architecture decisions like having multiple codebases which update the same data set.

Another example:

We need to add unit tests → there must be 100% code coverage.

The statement that “parts of the code should have better tests” is very different from “every part of the code must have tests.” There may be old areas of the code which just aren’t worth testing; or highly stateful workflows where maintaining tests is more like updating a complicated simulator; or 3rd party integrations without test environments that make automation a challenge. While a leader may be tempted to frame a goal in the most ambitious terms (“everything must have tests!”) with the tacit understanding that people should apply discretion, that’s not a great idea because engineers do take things literally and it’s very easy for an initiative to spin out of control.

Keep tech debt projects on-track and focused on delivering value. It’s important that the organization sees regular progress and that projects are set up for success by having clearly defining objectives set ahead of time. And finally, make sure to recognize and reward engineers who work on code clean-up because it is just as important as feature work.

Other Resources

Accelerate is a must-read book for any technical leader. It’s especially useful for choosing which metrics to track and to justify investing in DevOps.
Martin Fowler has a great post about the types of tech debt, and whether design flaws or sloppy code count as debt. I acknowledge I’ve been a bit loose in my definition. And I didn’t even get into product debt, where the feature requirements have stacked to the point of staggering complexity.
Steve Rabin has a very thoughtful post explaining what tech debt is and how to address it.