Technical Air Pollution: The Reality of Technical Debt

Deficiencies that initially reduce your velocity might eventually kill your product

Ofer Karp

Published in

The Startup

12 min readOct 28, 2020

“To the Power Plant Run-Off Stream, please, I need to collect a Three-Eyed Fish for my Science Class” (Lisa Simpson)

Metaphors are great. Especially when we as developers need to explain our geeky world to people who don’t write code.

Technical Debt is a metaphor, coined by Ward Cunningham, co-author of the agile software development manifesto, as well as a guru in the fields of object-oriented programming and design patterns. In this video, he explains the tech debt metaphor:

The explanation I gave to my boss, and this was financial software, was a financial analogy I called “the debt metaphor”. And that said that if we failed to make our program align with what we then understood to be the proper way to think about our financial objects, then we were gonna continually stumble over that disagreement and that would slow us down which was like paying interest on a loan.

The debt metaphor is widely used. I would be very surprised if there is a single engineering manager who isn’t familiar with it. We can all relate to the notions of financial debt and we all hate paying interest.

Personally, I don’t think debt is the best metaphor to describe the effect that tech deficiencies have on our product. In a way, I wish Cunningham wasn’t working for a financial institution at the time he coined the metaphor. Instead, I wish he was building software for some kind of environment saving org. Why? because I think a better metaphor would be “Technical Air Pollution”. Let me explain:

Air Pollution 101 — The vicious cycle

In our context, the main takeaway from this video is the vicious cycle:

Air pollution contributes to climate change
Climate change creates higher temperatures
Higher temperatures intensify air pollution

In a similar way, in the engineering world:

Tech gaps contribute to velocity reduction
Velocity reduction creates pressure to take shortcuts
Taking shortcuts intensify tech gaps

Which tech gaps are we talking about?

Let’s take one step back from the metaphor, and make sure we are aligned on the essence of these tech gaps. Most people think about code that was written in a “cowboy style”, but in reality, this is only one out of the many types of tech gaps that are slowing us down. To make it possible to identify, measure, and track, I am grouping tech gaps into 10 buckets:

Dev Environment: How much time a developer needs to spend before she can start contributing code? Things like: setting up the IDE, environment variables, local server deployment, getting access to AWS/GCP console, Jira, GitHub, Jenkins, etc.
Infrastructure: Is the service we are developing making use of the most suitable infra? This can include the compute, network, and storage aspects, and the usual areas to look for gaps are around high availability and scalability.
Architecture: Is the structure of the system serving you best when it comes to handling the complexity of the problem that your product is supposed to be solving? This includes using the right programming language, platforms, architecture model (e.g micro-services vs monolith), frameworks, etc.
Code: How easy it is to modify or to add functionality to existing code? This includes things like separation to modules, using design patterns, naming conventions, and principles like S.O.L.I.D, DRY, KISS, YAGNI, etc.

KISS, DRY, and Code Principles Every Developer Should Follow

The secrets of successful software engineers

medium.com

5. CI: How much time and effort does it take for a change to be integrated and deployed into a fully operational environment? This mainly has to do with the quality and performance of your automated pipeline, but also with the way changes are promoted from the dev environment, to test, to staging, etc.

6. Testing: How much time and effort does it take to verify that a change can be integrated and promoted to the next phase in the CI/CD pipeline? This is mostly about the coverage and robustness of your test automation suite.

7. Deployment: How reliable and controlled is the pipeline that deploys a change to production? This includes the ability to trigger, get visibility into the progress, and when necessary perform a gradual rollout.

8. Observability: How complex is it to get a full picture of the health status of the application? This is mostly about collecting and correlating metrics, logs, and traces from all the services and infrastructure components.

9. Troubleshooting: How much time and effort does it take to handle a production issue? This includes processes and tools for root cause analysis, as well as the ability to efficiently create and deploy a fix.

10. Knowledge: Are the areas in the system for which there is only one member in the team that can modify or extend the code? This is mostly about documentation, code reviews, and knowledge matrix.

What are the effects of tech gaps?

Now that we have listed the different types of tech gaps, the next phase is to understand the effects they might have. There are some effects that are specific to different teams and different products, but in general, I find that the top 5 effect categories are:

Productivity: How much value can the team create for the business in a given period of time? Tech gaps with high effect on the productivity category mean that there is high overhead around the actual task that needs to get done and as a result, the team is effectively slower than its potential.
Quality: How close is a feature that is being delivered by the team to the requirements that were defined? Tech gaps with a high effect on the quality category simply mean that there are too many defects in the product which naturally impacts customer satisfaction.
Ownership: What is the level of independence that a developer has when it comes to being able to implement and operate a new feature? Tech gaps with a high effect on the ownership category mean that people don’t feel empowered hence don’t hold themselves accountable for the quality of features they developed.
Stress: How many urgent and unplanned tasks are developers handling? Tech gaps with a high effect on the stress category mean that people are frequently forced to switch context from one task to another, and are constantly working under high pressure and uncertainty.
Talent: How difficult is it to retain your key employees and attract top talent to join your org? Tech gaps with a high effect on the talent category mean that you have high employee attrition and that you are struggling to recruit talented people.

״Tech debt? we simply implemented the adapter design pattern״

How to rate tech gaps & their effects?

So we have 10 tech gaps buckets and 5 effect categories. The next step is to analyze our team and our product. The idea is to answer 2 questions:

Which tech gaps have the highest impact level per effect category?
Which effect categories our team is mostly suffering from?

The result of the analysis should be a rating of the impact (on a scale of 0 to 10) that each of the tech gaps buckets has on each of the effect categories. I find that a radar chart is a good way to visualize the results. This is how the radar looked for an engineering org that I was leading not long ago:

Let’s use a few examples:

Effect of the gaps we have in CI on ownership: 5
Effect of the gaps we have in observability on stress: 9
Effect of the gaps we have in testing on productivity: 10

Thanks to the rating, we can also understand that:

Observability is the bucket that has the most impact on ownership
Architecture is the bucket that has the most impact on talent
Testing is the bucket that has the most impact on productivity

How to identify & measure tech gaps?

But how do you actually perform this analysis and come up with this chart? Let me start by sharing the wrong way of doing it. Why? because this is one of those areas where following our instincts might lead us to take a wrong decision and fail. Don’t ask how I know that. So, here is what we don’t want to do: schedule a meeting named “Eliminating our technical debt” to which you invite 15 people who are considered to be the “technical leaders” in the org (usually most vocal). I can promise you that such a meeting would end up with the conclusion that the only way to manage the existing tech gaps is to re-write the entire product.

A better option would be taking a data-driven approach. Open your project management tool (e.g. Jira) and create a list of the major projects that your team is currently working on, or was working on in the last 6 months. For every project, look at the overall investment (story points, developer days, or whatever allocation unit you are using) and ask yourself: is it supposed to take that much of an effort and resources to complete such a project?

The projects for which your answer was “this looks like a huge effort for such a project” are the ones you want to drill into. Go over the different tasks to find out where and by whom most of the time was spent. Only then should you schedule meetings, and this should ideally be 1:1 meetings with the developers who were assigned to these long tasks. In the meeting itself, I find that using technics borrowed from the UX world works great. Specifically, I am trying to apply methods from the domain of conducting user interviews into these discussions with developers about tech gaps that slow them down.

How to conduct user interviews

User interviews are a great way to gain foundational knowledge about the problems your users are facing. And a lot of…

uxdesign.cc

How to manage existing technical gaps?

Thanks to the 1:1 discussions we were able to complete the analysis. We now have our own version of the tech gaps radar chart. The next step is to decide which of the effects we want to tackle and use the chart to determine which of the buckets we need to improve. For example, if we want to fight against loss of productivity, the chart shows us that we need to improve on testing (10) and troubleshooting (8).

In terms of the actual actions that need to be taken in order to reduce tech gaps, this is very specific to each bucket (e.g. how to build a good automated test suite is completely different from how to close knowledge gaps), and each bucket deserves an article of its own. But from a management perspective, there are several methods that are worth mentioning:

Technical backlog: Just like the way we manage new features in a product backlog, we need to manage the initiatives that are related to closing tech gaps in a technical backlog. It is important that we provide the same level of visibility and apply the same tracking methodologies to items coming from both backlogs.
The boy scout rule: The Boy Scouts have a rule: “Always leave the campground cleaner than you found it”. The same should apply to every code change performed by any team member. The best way to take this from theory to reality is to have this included in your code review checklist.

Step 8: The Boy Scout Rule ~Robert C. Martin (Uncle Bob)

This is the Eighth Step towards gaining the Programming Enlightenment series. If you didn’t learn the Seventh step…

medium.com

Cleanup capacity: One of the methods for allocating resources to closing tech gaps. This is the “social” method, and the way it usually works is that at the beginning of each sprint you define the allocation ratio between product and tech, say 80% product, and 20% tech. Each team assigns people to tasks according to this ratio, and you are simply using the 2 separate backlogs (product & technical) to plan and manage both agendas.
Cleanup sprint: An more aggressive method for allocating resources to closing tech gaps. This is a “stop the world” maneuver where you allocate the entire org to handle several items from the technical backlog in parallel, and this means that until these tech gaps are closed there is zero progress on the product backlog.
Cleanup task force: Another method for allocating resources to closing tech gaps. This is the “ninja” method and the way it works is to create a dedicated task force (usually 4–6 engineers) that will be completely focused on closing a specific tech gap.
Major rewrites: When conventional weapons don’t seem to make an impact, we always have the “let’s nuke them” option. This is of course very risky. The most reasonable way of doing it is to try to break down the system into smaller components/services, and each time re-implement one service that seems to be in the worst status in terms of tech gaps.

Nothing will survive, not even bacteria” (Dan Truman)

How to control technical gaps creation?

Considering how hard it is to close the existing tech gaps, one can argue that we should strive to completely avoid the creation of new tech gaps. But is it a realistic strategy? I don’t think so. There are situations in which deliberately creating tech gaps is the right way to go. As engineering managers, we need to know how to identify these situations, and how to create tech gaps in a way that will allow our team to close them later with a reasonable effort.

Personally, my rule of thumb is to make a distinction between projects in which we are modifying existing features versus projects in which we are creating new features. With existing features, I will be very reluctant to take shortcuts and create gaps. I will review the design and verify that on each of the 10 tech gaps buckets mentioned above, the project being planned is going to “leave the campground cleaner than you found it”. Boy Scout Oath.

On the other hand, when it comes to building new products and features, I am much more willing to accept a design decision that is not optimal from a tech perspective if it truly enables us to bring the new product to market faster. In these cases, my trick is to actually push the team to build the new feature in a way that makes everyone aware of the tradeoff we had and the decision we took. What does it mean in reality? It mainly has to do with isolation. It can start will simply storing the code of the new feature in a separate repo (even if the team usually uses monorepo), and it can go all the way to implementing the new service with a different programming language, using a different cloud provider to host it, etc. Beyond just being a nice opportunity for the team to try out new technologies, we are also “marking the boundaries” of the area in which we intentionally accepted tech gaps. Later on, if the new feature will prove to be successful from a business perspective, we will know exactly what needs to be fixed in terms of not allowing tech gaps to crawl into the core of our product.

“If something goes wrong at the plant, blame the guy who can’t speak English” (Homer Simpson)

Conclusion

In every product we build, there will always be gaps between the desired state and the actual state of the system. In some cases, we are deliberately creating these tech gaps along the lifecycle of the product because we need to optimize for another factor, which is usually the speed of delivery.

Unlike a financial debt, which is visible and carries interest that is relative to the size of the loan we took, air pollution can be invisible and has effects that are not only relative to the amount of pollution we have, but also to the size of our population and to the size of our economy.

When tech gaps aren’t managed, the effects on our product and on our org would be similar to the effects that air pollution has on our world. As our team and business grow, the physical health (productivity) and mental health (stress) of more people (developers) will be impacted. Eventually, the yield of our fields (velocity) and even our climate (product quality) will also take a hit.