Management Area 3: Operations
Is the built thing going to keep running?
(Part Three of The Seven Areas Of Software Management)
When I entered the industry at the turn of the millennium, it was for a company doing “boxed software” drops twice a year, for which there was little operational burden — staffing a “tier 3 support” email rotation to triage field issues, and the occasional ask to stop feature development and do a hot fix for a customer issue. Thus when I journeyed across the world to Seattle to join Amazon.com in 2006, I was surprised to be told “here’s your pager and you will be on-call 24/7 every 6 weeks”.
In the time since, much of the software industry has moved to this model — coupling agile planning, software as a service, continuous delivery, and development teams owning operations. These embrace that since software has no physical aspect, you can succeed in building a great product by shipping something minimal quickly and then continuously improving as you learn exactly where it is needed. The challenge is modifications take time to develop, so if you ship something under-specified to how customers start using it, you can find yourself in operational hell — your development team spending all their time doing hands-on toil keeping the system running, with no time to develop the reliability improvements to stop needing to do so. Preventing this from happening is especially difficult as shipping new features to win customers is far more visible to the organization than the work to ensure the system will reliably operate after they are won.
As a result, a number of practices have come about to balance features with reliability, with devops and Site Reliability Engineering (SRE) being predominant. They are well worth reading about, although I find SRE has muddied the waters blurring questions of organizational structure into questions of practices, and while org structure is important, it doesn’t help where most development managers find themselves today:
- Your team develops software
- Your team owns tier 1 support for all or most issues
- Your team is expected to monitor the deployed software and fix issues ahead of customer impact
- Your team is software engineers none of whom are inherently motivated to do most of this.
- The business wants you to ship features as fast as you can, and expects you personally to do the balancing with reliability to make sure that it doesn’t come crashing down.
Understanding Your Position
A lot like Engineering, most of tactics in Operations is process. A list of basics are:
- Telemetry and Production Debugging —e.g., logs, metrics and APM
- Synthetic/Active Monitoring
- Capacity Planning and Scaling
- Change Management and Deployments — e.g., integration testing, phased rollouts, feature flags, service discovery, canaries
- Post mortems
Again the questions are:
- What is being done by your teams?
- What are organization policies?
- What are industry best practices?
- What is your engineer engagement and happiness with processes?
- How is your team’s performance perceived by partners and customers?
Whether you are new to a team, or have a team you have been with for a long time, one things to be wary of is normalization of deviance — practices that are substandard to an outsider, but inside the team are seen as acceptable as “it hasn’t caused problems yet”. In Operations, this is often only because of a reliance on the deep knowledge of the experts who built it also being the ones who are operating it, which leaves you setup to be in a bad place when they leave:
- System quickly built to handle an acute business or operational issue, with almost no investment in operational best practices. Starts showing success, everyone celebrates, gets promoted, etc
- System starts having operational issues, but the deep experts who built it know just where to quickly place the duct tape for any specific issue, so operational load seems fine.
- System scale of usage and breadth of use cases keeps growing
- Toil load has scaled linearly and is now in a bad place. Deep experts burn out doing too much toil and leave for greener pastures
- Team brings in new engineers, they make mistakes in placing the duct tape, now customers and the business start feeling impact.
- Business finally realizes the team is in a bad place
- System won’t stop growing, so team is now in such a place it takes a couple of years to get on top of the reliability work that should have been done back at step 1, costing the business growth opportunities, the team its good and deep engineers, and you personally a lot of grey hair.
So when you ask these questions about process, make sure you are really understanding risk. For teams in a poor place tactically, these are the types of goals to take:
- Goal: Establish a on-call rotation across your team
- Goal: Have your team deliver a well received operability review for X
- Goal: All systems have a CI pipeline
- Goal: All systems have a canary deployment mechanism
- Goal: All systems have basic active/synthetic monitoring
- Goal: All post mortems reviewed within N days
- Goal: All post mortem action items closed out within N days
Getting to a Strategy
My favorite question when I interview a manager who owns 24x7 operations is “how do you sleep at night?” Beyond the obvious, this is trying to get at how they handle this area strategically — which especially under pressure to ship features, it’s difficult to rely on “because my processes are good” as that is a static sample, whereas the tradeoff of reliability vs features is ongoing, and so you need a way of getting the balance right to get ahead of future issues. To which I have found the best practices is 5 mechanisms/metrics:
- Pager Load
- Toil Load
- Risk Backlog
- Availability and Performance SLOs
- Weekly Operational Reviews
The purpose of paging a developer on-call is when you have an urgent operational issue you need a developer to resolve. Thus it maps well to customer pain. But of course a developer being paged is a pain, even in working hours it distracts from what they were focussed on. And on the weekends, or at 2AM in the morning, it is much worse.
The nice thing about pager load is it’s easily measured. The difficult thing is not all pages are the same amount of pain— an event causing 7 pages is not the same as 7 events, and 7 customer impacting events over 7 nights at 2AM are not the same as 7 false alarms during office hours you were warned about. Still, having argued for mechanisms that account for those in the past, I now find it is simplest to ignore special cases, just focus on the raw count per week, and have a high bar. Which for a 24x7 ~6 person development team rotation is:
- < 2 a week — team is healthy, no work needed
- Between 2 and 5 a week — team is healthy but walking the edge, should be spending some development time removing causes of pages to stop growth of count
- > 5 a week — Team is unhealthy. Resources need to be moved off features to get load back under 5.
These come from my last VP at Amazon who worked with HR to correlate pager load with attrition rates, and maps well to my seven years of experience being a primary on various pager rotations. Now as they pollute your signal, this simple measure means you will need to invest work in “deduplicating” pages and/or eliminating “it only fires when we deploy so its okay” type pages, but to me that is worth the cost vs a more complicated metric that can’t be compared across teams. So with that in mind, the goal to take is:
- Goal: Average less than N page events a week (N is 5 when you are taking the goal and the team is in a bad state, N is 2 when the team is in an okay state).
The concept of toil in devops owes its definition and popularization to Google SRE. Toil is not inherently impactful to customers, but is painful to most engineers, for the reason that most people who become software engineers do so as their job realizes two motivations:
- Creating software that improves the world (i.e. building)
- The sense of flow that comes from contiguous time writing challenging code (i.e. coding)
Toil gives no such motivation. Engineers will do it because it’s the right thing for their team, or the right thing for the business and they want to get paid. But that is a more basic level of motivation than what other roles usually offer, and so high toil causes high turnover, which in turn causes loss of knowledge, and so operational errors, and so customer pain.
So how to measure toil? For lightly operationally loaded teams, the easiest option is to have a policy that the only person who touches production (even for deployments) be your on-call, at which point you can proxy toil with that. That works well for about a third of teams. If the load is too high for that, then having two rotations — a support on-call and a deployment on-call can be enough. That works well for about another third. In both cases the nice thing about this state is everyone not on-call should be able to ignore the production system, and focus on engineering.
Where teams struggle with those approaches is when they have high load and are “siloed” — most team members can only go deep on a subset of systems. So when you have an operational issue, you need engagement of people who aren’t on-call. That is a bad state for their flow and something you need to resolve, but it takes time to build tooling and do cross-skilling. So in that situation, you need to use your Execution mechanism to estimate toil. If the toil work is mostly non-urgent then you can integrate it into your Scrum process and use the estimates from that. If the toil is more urgent, you probably need to move to Kanban, which while non-ideal for software development, is a good way of handling a high-interrupt reality. In either case I have found a line manager can build a pretty good estimate of toil load just through that process.
One area to cover with toil is: what is the right amount? While some would say zero, the reality is systems are complex, support cases happen, and automation isn’t free. Google documents its SRE teams goal is 50%, but that is too high for a team doing both features and reliability; rather you want to target about 25% — in a six person team just the on-call fully focussed on support and half a head doing things like deployments or region builds. With that in mind, potential goals are:
- Goal: Establish measure for toil. Drive toil down to 25% total time by EOY.
- Goal: Establish processes and tools so that the on-call is able to fully handle 90% of high and low priority issues without escalating to other team members
- Goal: Average less than N low priority system monitors issues a week
- Goal: Deliver runbooks and tools to support team so they are able to escalate less than N customer support issues to your team a week.
A lot of people proxy toil for all technical debt, but that misses the risk that has built up in systems under ever increasing scale which is not causing issues today but could cause very impactful issues in the future. So just focussing on toil means you spend too much time urgently band-aiding as things break, where you should be re-engineering the system to get ahead of the next multiple of scale. In EC2 VPC when we measured we found our engineers time divided equally between toil, automation to reduce toil, re-engineering systems to scale and get ahead of systemic risk, and delivering new features.
Now the first question is how to get to a list of risks? I have found that part isn’t hard; as much as normalized deviance is real, if you asks your engineers “where is all the risky parts of our system”, they will have no trouble answering, in fact the opposite. Actually that is where the normalized deviance comes from, the list looks so long it’s easy to give up hope of impactful change. And so your challenge is to come up with the subset to take action on. Here it turns out there’s a whole industry of IT risk management that has built frameworks; I particularly like the Impact and Likelihood matrix explained here. The main thing I add is a “level of effort” column, as you will find varying amounts of effort to handle items at the same severity level, and so it follows you use effort to guide prioritization. Another thing is to include both security and operational risks; indeed this is your best way of normalizing the importance of the two. With all that in mind, the goal is:
- Goal: Build list of the 10 highest severity operational and security risks, with a funded plan to address in the next 12 months.
Availability and Performance SLOs
One concern as you move towards automation and limits to get on top of operational load, is you have made it hard to understand what customers require from your system — i.e. you have broken the “build as you learn” model. That is a fair but misplaced criticism — yes engineers getting woken up because your system is breaking is feedback, but that doesn’t mean to say its efficient feedback. Service Level Objectives (SLOs) and Error Budgets are formalizations from Google SRE that address this. Now this is one where I think you need to be careful about their rhetoric, as it relies heavily on the notion that for these to be effective, you need a structure of a SRE org to hold a Feature organization accountable for their SLOs. Whereas many of us experienced in other org structures know that is not the case at all.
The idea of metric driven organizations is old. The challenge is making metics have teeth in management prioritization decisions. Now a naive engineer will say “management needs to work out what they want”. Whereas an engineer-turned-manager knows “management” is 10+ different stakeholders with situationally distinct authority and influence, all with different qualitative opinions on what is right for the business. What a good metric does is capture business impact in a way all those stakeholders agree is reasonable, and thus be reasonably bound by. For that, there is a few best practices:
- Fewer SLOs is better. If your stakeholders have to drop something they want every time one of your ten SLO’s goes red, they are going to question why all ten matter. Ideally you have just one for availability, and one for performance. Now internally, you should have more, particularly around tracking deployment regressions. Just don’t expect those to have teeth with stakeholders.
- Ideal SLOs capture all support cases where your system is root cause. I.e., When you have upset customers and argue “but we met our SLO”, even just internally, you have weakened the authority of your SLO.
- Simplicity is ideal, but not authoritative. Simplicity aids ease of understanding across a broad audience. But if simplicity also means you are not capturing customer pain, then you should be fine going for a more complex SLO.
- Durability is key. If you ask a stakeholder to sign off on a tradeoff based on a SLO today, and 6 months later you say that SLO doesn’t matter, then you have weakened your authority in having SLOs be binding. The best SLOs are ones that last for years, being slowly “ratcheted up” in compliance rate.
An important part of SLO’s effectiveness is you need to get on top of internal violations before asking for external tradeoffs to be made. That is, if you are behind on your availability SLO, for example because your engineers didn’t sufficiently test for regressions, don’t expect you can ask to drop a feature so you can do a rewrite. This is especially important when an engineering team thinks their only way out of operational hell is a total rewrite, and they want their management to push back on all stakeholder asks on the existing system to make that happen; including reliability asks. While this may seem reasonable in terms of optimizing resources, from a stakeholders perspective, why should they trust you to have an all new system operating well, when you demonstrably are failing at doing that with the current house?
As teams behind a feature grow, SLOs become the most powerful way to keep an organization focussed on customer experience. In my experience Amazon was happy to devote ~10% of engineering headcount to building systems to measure SLOs. A particular callout was David Richardson, the best manager I ever worked with, who took over EBS after their outage in 2011, and whose main mechanism for balancing reliability vs features was a non-trivial performance SLO backed by a ~10 person team of engineers and data scientists who measured compliance both positively and synthetically. That may seem hard to justify the resources, but unless you are running a stateless request-reply service where you can just use nines of availability and response time, you need to invest in good SLOs for them to have an impact. So with that in mind, suggested goals are:
- Goal: Establish an availability SLO and hit it.
- Goal: Establish a performance SLOs and hit it.
- Goal: Ratchet up your current SLOs and hit them.
Weekly Operational Reviews
Everything until now has been about how to establish good goals, but has not addressed much of the real challenge: how to achieve the goals — i.e., how to make the right calls on reliability vs features, including getting buy in from your stakeholders to let you act on those calls?
This is best described through anecdote. Andy Jassy, CEO of AWS used to get on stage at every quarterly AWS all-hands and open with “Operations is our number one priority”. Now if you observed Andy from afar, this seemed like hypocrisy. Andy’s at heart a product guy; what excited him was new products, new features and new business. And yet, when EC2 VPC started struggling under scaling in 2014, Andy asked for a monthly review with EC2 VPC’s management, with us writing a 6 pager to cover both our performance against our operational goals, and short and long term plans to resolve. And Andy did not just attend the meeting — he was deeply engaged, with non-trivial questions every meeting, and willing to make tradeoff calls on the spot if need be (we would sometimes use the opportunity to pose such questions). And so this was a key lesson in management leadership — Andy spending his time and focus holding us accountable for our performance was all that was needed to empower us to make the right calls. I have a saying for this — “People follow their leaders focus”.
The simplest mechanism in this space is a weekly operations review, covering performance against goals, post mortems, and action items. That was standard AWS practice, indeed Charlie Bell, head of AWS Engineering, turned up to a 2 hour one and was engaged and asked deep questions. I will be honest though, my thinking at the time was that these tended toward a combination of bureaucracy and bad theatre — too many people playing up their “chops” to the senior audience. So these are the best practices I came to:
- 10 people ideally, 20 people at most.
- As opposed to bureaucratically following an agenda based on goal compliance, content should be week to week tailored to what is pertinent, actionable, and broadly interesting to all attendees. (This likely means you need to ask someone to spend 4–8 hours a week to prepare an agenda for each meeting).
- Senior leadership who own/influence reliability vs features calls must attend and actively engage in a way that engineers can see they understand they are engaged and so accountable.
- Meetings should telescope up the org, with more senior leadership and a smaller emphasis on deep details vs a heavier emphasis on broad lessons and trade offs as it moves up.
- Ideally 30 minutes.
And with that the goals for any manager are:
- Goal: Establish a weekly operational review meeting.
- Goal: Score 80% on agreement on “Our management understands our operational challenges” in a yearly survey of engineering.