How to deal with tech debt at the scale of super app
At Flo, every team is product-oriented, even platform teams. Focusing on the product is very important, but it has to be in balance with our development speed, which can decrease over time, giving us less opportunity to experiment or deliver value.
In this article, we want to share how we decided to work with technical debt and how Evolutionary architecture and SRE help us balance innovation and quality in mobile development.
We have an agreement that each team can dedicate up to 20% of their capacity to technical improvements.
It’s not about teams doing something wrong during a sprint and then making up for it within that quota. New functionality must meet our quality standards.
We also distinguish between two types of debts: product (conscious compromises) and technical debt (outdated technologies, engineering mistakes, incorrect technical solutions).
If we need to do something faster to validate an idea or react to a changing environment, it becomes a product debt because it was a conscious compromise. And product debt is not part of this quota.
So it’s a quota for innovation that helps our teams keep moving at the same speed or even faster. We can meet the quota either by utilizing new technologies or addressing existing technical debt. It’s essential to form the quota strategically to balance these two aspects.
How to form the technical quota
As managers, we don’t want to create a bottleneck by dictating the quota. Quite the opposite — we would like to give teams the freedom to make this choice without us. But at the same time, we must achieve the expected level of quality.
Mobile developers showed significant interest here. Perhaps this was due to many limitations in mobile development, since a big app is a monolith with a complex release cycle.
We had a lot of meetings with them, trying to find a solution. And we came up with some ideas.
What if our tech quota is an error budget
Most mobile startups use statistics similar to the chart below, but few use SRE practices such as Error Budget and SLO to work with it systematically.
An error budget is the maximum time a technical system can fail without consequences. It isn’t just a constraint — it’s also an opportunity for development teams to innovate and take risks within acceptable limits. They can spend this error budget in any way if the product meets SLOs or fix the situation if not.
SLOs (Service Level Objectives) are objectives that the team must achieve to meet agreements, depending on the team’s maturity and product needs.
You don’t need many of them — only system-wide and clear for everyone. In our case, they’re the crash rate, size of the app, app start time, and battery usage.
For example, in the case of crash rate, we believe that we are mature enough to achieve a 99.9% crash-free sessions rate, and we have a direct correlation between the number of such errors and the rating in app stores.
If we consider the quota as an error budget, then:
- If we meet our SLO, it means that teams can spend their technical quota in any way they want.
- If we operate below the defined SLO (exceed the error budget), all experiments and additional initiatives of the responsible team are frozen until we meet our SLO again.
How do you find a responsible team? The process relies on domain ownership:
- We don’t have code without ownership.
- Service registry document is the single source of truth for the team-to-domain mapping.
- The ownership over the domain belongs to the team, not individuals.
- Engineering managers are responsible for keeping the team-to-domain mapping up-to-date.
In general, this approach works well. However, it still does not influence the quality of the codebase.
Churn vs. Complexity, or how to find pain points in your codebase
At the same time, we started to use another approach that helps us find the codebase’s pain points using metrics.
It’s a simple technique that I highly recommend to everyone who deals with monoliths such as big mobile applications.
And it’s easy to implement. All you need is stats about your code complexity from tools like SonarQube, stats about code churn from your control version system, and simple visualization. That’s it.
Each point in the chart represents a class. The churn is the number of changes to a class over the past six months.
The way one should interpret this chart is by splitting it up into four quadrants:
- The bottom left and the bottom right are not interesting because the complexity is low.
- We see that code is complex on the top left, but it doesn’t often change. We could improve our code here, but it’s not beneficial because we don’t have a lot of new use cases.
- And the opposite situation with the top right quadrant: It’s our main area of focus because we have the high churn and the high complexity there.
The main idea is that we need to focus not just on complex code but on code that changes frequently. And this investment helps improve our cycle time, which is one of the leading indicators of process improvement.
In our case, engineers decide what they will be working on every quarter, and they’ve started planning improvements for classes from the top right quadrant.
Evolutionary architecture and fitness functions
In a nutshell, evolutionary architecture is an approach of incremental changes, relying on fitness functions that are some kind of navigator of change that lead to proper modularity of the system.
And we are inspired by this idea.
However, defining fitness functions is a challenging exercise.
We see a lot of examples that are good for health checks: unit test coverage, static code checks, number of crashes, and even complex architecture checks with the help of special libs like ArchUnit.
And, of course, we use those too.
But with a lot of metrics, it’s easy to get confused. That’s why we were looking for a system-wide function that would force us to make incremental improvements.
Elaborating on the idea of Churn vs. Complexity, we tried to formalize the evaluation into a formula. And our engineers came up with the idea to take unit test coverage into account. As a result, we introduced the technical debt function, where good test coverage could reduce the overall score:
You have to be careful regarding the coverage of complex code by tests. Most likely, the tests will be complex too, fragile with many mocks. Their support will be challenging. So it’s always better to combine them with the proper refactoring of code to be testable.
Of course, that’s not the only thing to look at. But it does show the system’s complexity at any given time, which allows us to react immediately. And this is a crucial aspect of evolutionary architecture.
How to combine SRE and Evolutionary architecture into one system
In SRE, the error budget is based on uptime, and teams monitor it periodically.
In our case, it’s based on our tech debt function.
If a team has classes where the score is higher than the allowed level, it means they’ve exceeded their innovation budget and need to fix this.
This approach allows us to reduce tech debt and influence cycle time. It’s also crucial that we do it without micromanagement, because we have a system that allows us to innovate but keep our bar at the needed level of quality.
Engineering OKRs & Mobile community
One more important thing that helps us work together is the community of practices.
The downside is that we have only one iOS and one Android developer per team, leading to weak collaboration between mobile developers.
So the community of practice is essential for us.
Eventually, we started using the same planning process as product teams, with OKRs for goals and alignment with engineering strategy and ICE scoring for prioritization. This helps mobile developers move in the same direction even though they’re from different teams.
While working on technical improvements, it is impossible to make every decision just by formal agreements — you need a dialogue. Using the following techniques allows us to do this constructively.
Request for Comments (RFC) and Architectural Decision Records (ADR) discussion techniques around the artifacts.
RFC is for an initial discussion of a big challenge (often beyond one team) where everyone can add their point of view.
ADR is better when the range of problems is narrowed down, and we need to make a decision based on possible options. If they’re lightweight, they’re handy for understanding the history and context of change.
These artifacts can be part of both product and technical quotas, depending on the problem. Our product owners know that they help discover complex solutions and reduce risks during implementation.
Tech Radar is a tool to inspire teams to pick the right technologies and techniques. The radar itself is not as important as the process of filling it. The same understanding of tradeoffs and assessment models (ADOPT, TRIAL, ASSESS, HOLD) helps with alignment.
It’s a purely technical initiative that’s done periodically.
Event Storming is a workshop format for collaborative exploration of complex domains to achieve domain-driven design. We also use this for team responsibilities validation.
Having product owners during those workshops is even more critical than engineers. The results of these sessions often lead to organizational changes, which in turn affects the architecture (Conway’s law at play once again).
Over time, the complexity of the system increases. We need to make continuous technical improvements to maintain our speed. That’s why we have a quota for them.
We treat our technical quota as an error budget, using SLO and fitness functions to meet our quality goals in balance with innovations.
It’s essential to look around and use what works in other domains to build a system based on high-level principles that gives us room for tactics in different situations.
The solid technical background of engineering managers allows them to be aware of technical challenges and help with changes.
Give freedom to engineers and motivate them to participate in the process. A lot of the ideas that you read in this article came from them.