5-Minute DevOps: Operations is the Point
I had a discussion recently about “feature teams,” why they are bad for quality, and how they are yet another example of the bad ideas pushed by SAFe that create a minefield for people to navigate while trying to find the good ideas SAFe copied into their framework. According to Scaled Agile Framework, a feature team focuses on delivering a complete end-to-end feature from the user’s perspective. In contrast, a “component team” focuses on delivering specific business capabilities; a team focused on tax calculation or inventory management, for example.
At first glance, this seems fine from the perspective of optimizing resource utilization. We can assign teams to the stream of features on the backlog and keep everyone busy. However, this ignores the impact of the communication problems between teams, how merge conflicts will be resolved, the cognitive load of understanding everything about the system, etc. Their diagram above minimizes the complexity of any application that lends itself to more than one team contributing to the full stack or the impact and complexity of three, four, or more teams trying to coordinate work in silos across the entire application.
I’ve personally worked that way in the past and have had to help coordinate the code integration. This pattern also ignores everything we’ve known for more than 50 years about the impact of communication on application architecture and the positive impact of “you build it, you run it” on quality and stability. This brings us to one other thing this pattern ignores, maximizing utilization isn’t the goal. Delivering value is the goal. Delivering value means we can deliver, get rapid feedback, and adjust. It also means we react quickly and correct failure. Those two capabilities are tightly coupled and require better organizational design.
Someone asked me, “To help lower the zone of resistance for those that rely on this, what’s the preferable alternative?”
The alternative is to clearly state the business goals and then design the organization to follow the path of least resistance to get those outcomes. Start at the end and work backward.
Goal: Improve our Mean Time to Repair (MTTR)
A primary design concern of a Formula One race car is to minimize the time for a pit stop. They also practice the teamwork required to leverage those design elements most effectively. As a result, pit stops average less than 3 seconds in F1. We have a similar challenge in software. For software to be useful, people need to rely on it, so when things go wrong, we need to be able to recover quickly. To keep MTTR low, we need the system to be easy to fix. We need the tools, knowledge, behaviors, and application architecture that enable that. We also need the organizational structure that enables those. Once we engineer the ability to respond quickly, we can leverage that ability to deliver new changes and respond quickly to feedback on new features.
Recovering quickly from failure requires a reliable way to deliver changes in an emergency that is safe and fast. Automating that process from end to end ensures we do not introduce variance from mistakes executing manual steps under stress. We also need to ensure that our emergency fix process is always ready. The worst time to discover that it’s not is during an emergency. We do that by only using our emergency process to deliver all changes. Not only does that help us continuously improve our ability to respond quickly, but it removes yet another source of defect-generating variance to have a single workflow for all changes. It also continuously reduces the cost of change as we keep optimizing the pipeline for operational support.
Design for Operations
Our delivery pipeline is feature zero. The next concern is observability. That should be baked in from the first delivery, not something that we worry about at some later date after we are “done.” The application should alert us when failure is occurring or about to occur. Hopefully, we’ve also designed for possible external impacts and can degrade features gracefully as dependencies fail (I never trust any hardware or software dependency). It should be easy to identify where and what to fix because the code is easy to understand. Modular design and clean architecture keep the complexity low and make the system easier to fix. We also need a team supporting the application that understands it.
Build a Support Organization
We need teams that can quickly identify what and where something is broken to minimize the time required for discovery during an incident. For a team to quickly determine where something is broken requires understanding the problem the application is solving, how it solves it, and the architecture and culture of the code. For small applications, this is not too difficult. A single team can support the application and become very familiar with it. For larger applications, we leverage strategic Domain Driven Design.
We deliberately align the support teams to specific business capabilities, so they can become domain experts and experts in the code that implements the domains they support. Keeping Conway’s Law in mind, we make sure that the same team supports closely related components, and components we want to keep loosely coupled we put on different teams. This lowers the cognitive load and enables the team to address production issues quickly while maintaining the desired architecture.
Evolving the Application
For the application to remain useful, it should be easy to evolve as we learn more and the users’ needs change. We also need to grow a team that can quickly evolve the system without degrading our stability. We don’t want teams randomly contributing new features to the application, or it will degrade the architecture and increase tech debt. This has been demonstrated for decades. Conway’s Law is always in effect. For a team to quickly evolve the application without degrading operations requires they become domain experts and experts in the code that implements the domains they are upgrading. Notice that’s the same requirement to enable the support team to fix incidents quickly.
Why two teams?
That is a good question. Why do we want one team adding new features to the application and another making defect fixes? We don’t. That causes communication issues, hides valuable quality feedback from the development team, and inevitably increases tech debt, making the application less easy to change. Also, consider the difference between a “defect” and a “new feature.” The only reason to add a new feature is that the application is not behaving or performing as needed. That’s exactly the same definition as a defect. The difference between a defect and a feature is only the priority. If two teams are “fixing” the application independently, we have two sources of truth for quality and communication problems that degrade quality feedback. This is the same problem we have with feature teams.
The team with the best view of the system's stability and the most experience operating it is also the most qualified to add new features. They know the system best, understand the problem space, and care the most about the delivered outcomes because they hear about the problems first. So, the obvious choice is to eliminate the development team and give the team supporting the business capability the ownership for upgrading it.
What about Site Reliability Engineers?
Shouldn’t we have an SRE group handle support?
SREs don’t own application support. SREs work with product teams to monitor and provide feedback so the team can make their applications more reliable. They are the first people to know when there is a problem, but they do not own the resolution. If the team’s application becomes so unstable that it requires too much attention from the SRE team, it is returned to the team for stabilization. They are not victims.
But I don’t have access to production!
It’s worth mentioning that the people who most need feedback from production don’t always have access. It terrifies me to build something without the ability to get feedback on stability, functionality, etc. I’ve had to do that before. It always ends in tears. If you don’t have access to production, you need to find a way to mitigate that risk. Do whatever you can to get feedback on “are we building the right thing?” and “is it being built the right way?” Can we get an environment that matches production? Is there a way to get feedback from intended users? Every line of code changed but not validated in a production-like environment is a risk and a potential defect. Keep an eye on it.
Optimizing for the Wrong Things
Many organizations are designed around Management Experience (ManEx) or the desire to maximize utilization or output, but none of those goals solves users’ problems effectively.
A previous organization I worked for decided to “improve developer productivity” by removing the “distraction” of production support from the development teams. In the original process, the call center would call the dev teams after running whatever playbooks they provided to the support team didn’t work. In the new “improved” process, support was an entirely different organization under a different reporting structure responsible for Tier 1 and 2 support with the development teams as Tier 3. This degraded the user experience.
The immediate impact was that instead of a development team getting called within 15 minutes of an incident, we were called about 90 minutes into the problem. We then had to get up to speed on the context of the issue before we could help resolve it in the middle of the night. This extended downtimes and meant we were generally dealing with hostile users. We also lost the ability to train new team members on support using lower-impact problems because we no longer saw the smaller issues. Instead, the support organization would write a script to patch the system or even deploy a fix to the code. This hid the low-level issues from the development teams. This lack of quality feedback meant that small defects evolved into big defects as development continued while being blind to low-level problems the new code was being built onto. We solved this by reorganizing into product teams from feature teams, switching to a CD workflow, and making changes faster than the support org could absorb. This meant that support defaulted back to us, we received much better quality feedback and could respond to incidents and evolve the system much faster.
Pushback from SAFe
When I posted the SAFe recommendation for feature teams as an example of one of the many anti-patterns pushed by SAFe, a “Certified SAFe Practice Consultant” responded.
I’d argue the feature factory pattern is at least as bad with DDD-based business service (component) teams that only deliver specific service outputs and are disconnected from end-to-end business outcomes, as these are only achievable upon successful E2E integration of each relevant service and no single [component] is responsible for that larger vision/outcome.
They might argue that, but they provided no evidence. It may seem reasonable to them because they also commented on ensuring people had enough to work on. That’s not our goal. Operational stability and rapid evolution are our goals.
I’ve worked for years using both patterns. In every large system I’ve worked in that was organized around capabilities, every team knew how their component fit into the larger whole because of both constant high-level roadmap discussions and the domain diagrams that showed the interaction between components. Teams talked to each other to coordinate contract changes in a way that enabled asynchronous delivery and ensured interoperability. Teams understood how to handle data to prevent inappropriate mutation. There were no capability overlaps that created multiple sources of truth for flow or information. Coordination was easy because teams only needed to coordinate with the teams they shared interfaces with, not the entire system. Quality was high because each team owned its capabilities from birth to death. This is exactly how Amazon and NetFlix work as well. Neither uses SAFe. Both are agile at scale because they manage scale with engineering, not process.
Design for Outcomes
“You build it; you run it” with deliberately designed component teams delivers higher quality than Jenga Driven Development with feature teams. It’s important that people understand what that means, though. That doesn’t mean the team owns all of the problems of infrastructure and delivery; that’s what good self-service platforms are for. Tools like Zarf can help reduce the effort and cognitive load of keeping the environment healthy by creating an immutable and repeatable system configuration.
Working backward from operations places our focus where it belongs, stable value delivery. The people closest to those using the product are the people best qualified to know how to improve the product. We build teams that own one or more product capabilities, own how they deliver those, and own the consequences of their decisions so they can quickly evolve and repair what they own. We design teams for value ownership, not output, and ensure they have clean communication paths for quality feedback.