Tracking your improvement — “metrics”
This is a draft excerpt from the third edition of my Cloud Native Journey booklet. Tell me what you think, and check out the other excerpts if curious for more on how organizations are improving their software capabilities.
Tracking the health of your overall innovation machine can be both overly simplified and overly complex. What you want to measure is how well you’re doing at software development and delivery as it relates to improving your organization’s goals. You’ll use these metrics to track how your organization is doing at any given time and, when things go wrong, get a sense of what needs to be fixed. As ever with management, you can look at this as a part of putting a small batch process in place: coming up with theories for how to solve your problems and verifying if the theory worked in practice or not.
All that monitoring
In IT most of the metrics you encounter are not actually business oriented and instead tell you about the health of your various IT systems and processes: how many nodes are left in a cluster, how much network traffic customers are bringing in, how many open bugs development has, or how many open tickets the help desk is dealing with on average.
All of these metrics can be valuable, just as all of them can be worthless in any given context. Most of these technical metrics, coupled with ample logs, are needed to diagnose problems as they come and go. In recent years, there’ve been many advances in end-to-end tracing thanks to tools like Zipkin and Spring Sleuth. Log management is well into its newest wave of improvements, and monitoring and IT management analytics are just ascending another cycle of innovation — they call it “observability” now, that way you know it’s different this time!
Instead of looking at all of these technical metrics, I want to look at a few common metrics that come up over and over in organizations that are improving their software capabilities.
Six common cloud native metrics
Some metrics consistently come up when measuring cloud native organizations:
Lead time is how long it takes to go from an idea to running code in production, it measures how long your small batch loop takes. It includes everything in-between from specifying the idea, writing the code and testing it, passing any governance and compliance needs, planning for deployment and management, and then getting it up and running in production.
If your lead time is consistent enough, you have a grip on IT’s capability to help the business by creating and deploying new software and features. Being this machine for innovation through software is, as you’ll hopefully recall, the whole point of all this cloud native, agile, DevOps, and digital transformation stuff.
As such, you want to monitoring your lead closely. Ideally, it should be a week. Some organizations go longer, up to two weeks, and some are even shorter, like daily. Target and then track an interval that makes sense for you. If you see your lead time growing, then you should stop everything, find the bottlenecks and fix them. If the bottlenecks can’t be fixed, then you probably need to do less each release.
Velocity shows how many features are typically deployed each week. Whether you call features “stories,” “story points,” “requirements,” or whatever else, you want to measure how many of them the team can complete each week; I’ll use the term “story.” Velocity tells you three things:
- Your progress to improving and ongoing performance — at first, you want to find out what your team’s velocity is. They will need to “calibrate” on what they’re capable of doing each week. Once you establish this base line, if it goes down something is going wrong and you can investigate.
- How much the team can deliver each week — once you know how many features your team can deliver each week, you can more reliability plan your road-maps. If a team can only deliver, for example, 3 stories each week, asking them to deliver 20 stories in a month is absurd. They’re simply not capable of doing that. Ideally, this means your estimates are no longer, well, always wrong.
- If the the scope of features is getting too big or too small — if previously, reliability performing team’s velocity starts to drop, it means that they’re scoping their stories incorrectly: they’re taking on too much work, or someone is forcing them to. On the other hand, if the team is suddenly able to deliver more stories each week or finds themselves with lots of extra time each week, it means they should take on more stories each week.
There are numerous ways to first calibrate on the number of stories a team can deliver each week and managing that process at first is very important. As they calibrate, your teams will, no doubt, get it wrong for many releases, which is to be expected (and one of the motivations in picking small projects at first instead of big, important ones). Other reports like burn down charts can help illustrate how the team’s velocity is getting closer to delivering across major releases (or in each release) and help you monitor any deviation from what’s normal.
In general, you want your software to be as responsive as possible. That is, you want it to be fast. We often think of speed in this case, how fast is the software running and how fast can it respond to requests? Latency is a slightly different way of thinking about speed, namely, how long does a request take end-to-end to process, returning back to the user.
Latency is different than the raw “speed” of the network. For example, a fast network will send a static file very quickly, but if the request requires connecting to a database to create and then retrieve a custom view of last week’s Austrian sales, it will take awhile and, thus, the latency will be much longer than downloaded an already made file.
From a user’s perspective, latency is important because an application that takes 3 minutes to respond versus 3 milliseconds might as well be “unavailable.” As such, latency is often the best way to measure if your software is working.
Measuring latency can be tricky….or really simple. Because it spans the entire transaction, you often need to rely on patching together a full view — or “trace” — of any given user transaction. This can be done by looking at locks, doing real or synthetic user-centric tracing, and using any number of application performance monitoring (APM) tools. Ideally, the platform you’re using will automatically monitor all user requests and also put together catalog all of the sub-processes and sub-sub-processes that make up the entire request. That way, you can start to figure why things are so slow.
Often, your systems and software will tell when there’s an error: an exception is thrown in the application layer because the email service is missing, an authentication service is unreachable so the user can’t login, a disk is failing to write data. Tracking and monitoring these errors is, obviously, a good idea. Some of them will range from “is smoke coming out of the box?” to more obtuse ones like servicing being unreachable because DNS is misconfigured. Oftentimes, errors are roll-ups of other problems: when a web server fails, returning a 500 response code, it means something went wrong, but doesn’t the error doesn’t usually tell you what happened.
Error rates also occur before production, while the software is being developed and tested. You can look at failed tests as error rates, as well as broken builds and failed compliance audits.
Fixing errors in development can be easier and more straight forward, whereas triaging and sorting through errors in production is an art. What’s important to track with errors is not just that one happened, but the rate at which they happen, perhaps errors per second. You’ll have to figure out an acceptable level of errors because there will be many of them. What you do about all these errors will be driven by your service targets. These targets may be foisted on you in the form of heritage Service Level Agreements or you might have been lucky enough to negotiate some sane targets.
Chances are, a certain rate of errors will be acceptable (have you ever noticed that sometimes, you just need to reload a web-page?) Each part of your stack will throw off and generate different errors: some are meaningless (perhaps they should be more warnings or even just informative notices, e.g., “you’re using an older framework that might be deprecated sometime in the next 30 years) and others could be too costly, or even impossible to fix (“1% of user’s audio uploads fail because their upload latency and bandwidth is too slow”). And some errors may be important above all else: if an email server is just losing emails every 5 minutes…something is terribly wrong.
Generally, errors are collected from logs, but you could also poll the service in question and it might send alerts to your monitoring systems, be that an IT management system or just your phone.
If you can accept the reality that things will go wrong with software, how quickly you can fix those problems becomes a key metric. It’s bad when an error happens, but it’s really bad if it takes you a long time to fix it.
Tracking mean-time-to-repair is an ongoing measurement of how quickly you can recovering from errors. As with most metrics, this gives you a target to improve towards and then allows you to make sure you’re not getting worse.
If you’re following cloud native practices and using a good platform, you can usually shrink your MTTR with the ability to roll back changes. If a release turns out to be bad (an error), you can back it out quickly, removing the problem. This doesn’t mean you should blithely roll out bad releases, of course.
Measuring MTTR might require tracking support tickets and otherwise manually tracking the time between incident detection and fix. As you automate remediations, you might be able to easily capture those rates. As with most of these metrics, what becomes important in the long term is tracking changes to your acceptable MTTR and figuring out why the negative changes are happening.
Everyone wants to measure cost, and there are many costs to measure. In addition to the time spent developing software and the money spent on infrastructure, there are ratios you’ll want to track like number of applications to platform operators. Typically, these kinds of ratios give you a quick sense of how efficiently IT runs. If each application takes one operator, something is probably missing from your platform and process. T-Mobile, for example, manages 11,000 containers in production with just 8 platform operators.
There are also less direct costs like opportunity and value lost due to waiting on slow release cycles. For example, the US Air Force calculated that is saved $391M by modernizing it’s software methodology. The point is that you need to obviously track the cost of what you’re doing, but you also need to track the costs of doing nothing, which might be much higher.
Of course, none of the metrics so far has measured the most valuable, but difficult metric: value delivered. How do you measure your software’s contribution to your organization’s goals? Measuring how the process and tools you use contributes to those goals is usually harder. This is the dicey plain of correlation versus causation.
Somehow, you need to come up with a scheme that shows and tracks how all this cloud native stuff you’re spending time and money on is helping the business grow. You want to measure value delivered over time to:
- Prove that you’re valuable and should keep living and get more funding,
- Figure out when you’re failing to deliver so that you can fix it
There are a few prototypes of linking cloud native activities to business value delivered. Let’s look at a few examples:
- As described in the case study above, when the IRS replaced call centers with poor availability with software, IT delivered clear business value. Latency and error rates decreased dramatically (with phone banks, only 37% of calls made it through) and the design improvements they discovered led to increased usage of the software, pulling people away from the phones. And, then, the results are clear: by the Fall of 2017, the this application had collected $440m in back taxes.
- Sometimes, delivering “value” means satisfying operational metrics rather than contributing dollars. This isn’t the best of all situations to be in, but if you’re told, for example, that in the next two years 60% of applications need to be “on the cloud,” then you know the business value you’re supposed to deliver on. In such cases, simply tracking the replatforming of applications to a cloud platform will probably suffice.
- Running existing businesses more efficiently is a popular goal, especially for large organizations. In this case, the value you deliver with cloud native will usually be speeding up businesses processes, removing wasted time and effort, and increasing quality. Duke Energy’s lineworker case is a good example, here. Duke gave lineworkers a better, highly tuned application that queue and coordinate their work in the field. The software increased lineworker’s productivity and reduced waste, directly creating business value in efficiencies.
- The US Air Force’s tanker scheduling case study is another good example here: by adapting a cloud native, software model they were able to ship the first version in 120 days and started saving $100,000’s in fuel costs each week. Additionally, the USAF computed the cost of delay — using the old methods that took longer — at $391M, a handy financial metric to consider.
- And, then, of course, there comes raw competition. This most easily manifests itself as time-to-market, either to match competitors or get new features out before them. Liberty Mutual’s ability to enter the Australian motorcycle market, from scratch, in six months is a good example. Others, like Comcast demonstrate competing with major disruptors like Netflix.
It’s easy to get very nuanced and detailed when you’re mapping IT to business value. You need to keep things as simple as possible, or, put another way, only as complex as needed. As with the example above, clearly link your cloud native efforts to straight forward business goals. Simply “delivering on our commitment to innovation” isn’t going to cut it. If you’re suffering under vague strategic goals, make them more concrete before you start using them to measure yourself. On the other end, just lowering costs might be a bad goal to shoot for. I talk with many organizations who used outsourcing to deliver on the strategic goal of lowering costs and now find themselves incapable of creating software at the pace their business needs to compete.
Fleshing out metrics
I’ve provided a simplistic start at metrics above. Each layer of your organization will want to add more detail to get better telemetry on itself. Creating a comprehensive, umbrella metrics system is impossible, but there are many good templates to start with.
Pivotal has been developing a cloud native centric template of metrics, divided into 5 categories:
These metrics cover platform operations, product, and business metrics. Not all organizations will want to use all of the metrics, and there’s usually some that are missing. But, this 5 S’s template is a good place to start.
If you prefer to go down rabbit holes rather than shelter under umbrellas, there are more specialized metric frameworks to start with. Platform operators should probably start by learning how the Google SRE team measures and manages Google, while developers could start by looking at TK( need some good resource ).
Whatever the case, make sure the metrics you choose are
- targeting the end goal of putting a small batch process in place to create better software,
- reporting on your ongoing improvement towards that goal, and,
- alerting you that you’re slipping and need to fix something…or find a new job.
This is a draft excerpt from the third edition of my Cloud Native Journey booklet. Tell me what you think, and check out the other excerpts if curious for more on how organizations are improving their software capabilities. Also, if you’re really interested, I have a talk on this topic.