Booking.com Engineering

Software engineering at Booking.com

How (Not) to Implement DORA Metrics

Egor Savochkin
Booking.com Engineering
11 min readMar 27, 2024

--

A step-by-step guide. Use it at your own risk, dude.

An executive meeting. People sitting around the table. Pictures with graphs with declining sales. “What if we don’t change at all and something magical just happens?”
[Source]

Disclaimer: Very often teams try to improve with the best intentions. Yet, the majority of those efforts are useless if teams do not follow the data-driven approach. The problem is that due to the complex non-linear nature of the systems we build, approaches such as HiPPO, “who shouts the loudest”, “expert opinion”, or “gut feeling” usually fail.

This post represents the author’s experience with the improvements done wrong in several companies and in many teams over several years. All mentioned situations are based on real events, yet are a composite sketch and mostly an exaggeration.

Let me set the scene: Your leadership decided to change. For some reason they decided to start with you.

They put a strange word “DORA” on your objectives. They want you to double or even triple the software delivery performance of the team. What is it? What will your strategy be? Below there are several (wrong) tips on how to survive this. But first…

The ‘truth’ about DORA

DevOps Research and Assessment (DORA) is a running research program. Its mission is to tell us how we should develop our software right and what numbers we should show to our managers to make them happy.

What people behind DORA do not tell is that they are hidden proponents of Extreme programming and Continuous Delivery. They wanted to promote them but did not know how.

Have you ever tried explaining to a sceptical manager why you should invest in test automation or tackle technical debt? It is even more difficult to explain why things like Test-driven development (TDD), pair programming, and deployment automation work.

Initially, they tried to write books, started blogging to promote clean code, refactoring, etc. Everybody reads them but nobody really uses them. Now they found a way — the DORA model. If you start using it — you’ll soon end up doing this stuff.

In order to make the research look solid they applied statistical methods to find a correlation between different capabilities people use and the so-called organisational performance. They insist that if you improve those capabilities you will also improve the delivery performance and the organisational performance. The progress is supposed to be measured by the following metrics:

  • Deployment frequency (DF): How often does your organisation deploy code to production?
  • Lead time for changes (LTFC): How long does it take to go from code committed to code running in production?
  • Change failure rate (CFR): What percentage of changes to production result in degraded service and need remediation?
  • Time to restore (TTR): How long does it generally take to restore service when a service incident or a defect that impacts users occurs?

Do you feel that it is not obvious how the deployment frequency of your services affects the stock prices of your company? Yeah… you are not alone. But it is math, dude. Just relax and enjoy. After all, you do not need to understand how planes fly without flapping their wings, right?

“Productivity is the act of bringing a company closer to its goal. Every action that brings a company closer to its goal is productive. Every action that does not bring a company closer to its goal is not productive.” [The Goal]

Somehow, the DORA metrics work. As an engineering organisation, they give us a good sense of the goal. Without the goal all our efforts are useless.

So, now that you understand a little more of what your leadership is up to, how do you go about avoiding implementing DORA metrics? I mean, it’s not like they’re that useful, are they?

Here is my guide on how not to implement DORA metrics.

Tip #1. Say you do not have time

As an engineering manager one of your most important responsibilities is the performance of your team. If the performance is very low over a long period of time, unfortunately, no one is to blame but you. That is why the improvement initiatives should be your primary focus in the long run.

Yet, In the short-term, there may be critical tasks of course. Find one — an incident, a task that can be attributed to a company must-do, etc. Be very positive and supportive but firm. Say that you support metrics, DORA, dolphins, pandas… whoever. Unfortunately, you and your team do not have time right now.

The good thing is that if you keep doing this you will accumulate burning tasks more and more. So soon you will not have any problems finding them.

Chances are that the leadership will leave you in peace and forget you for this year. They can focus on other teams after all. Otherwise, you lost this round.

Tip #2. Wait until all the tooling is ready

DORA throughput metrics (DF and LTFC) are quite straightforward and easy to collect. There are plenty of tools on the market for this. Chances are that even the coffee machine on your floor can collect all the metrics for you. Of course, you may need to do something to make it happen.

The DF is available in your deployment tool such as Harness. The MR/commit data is in a tool such as Gitlab. You need to either integrate them together or extract the data and join it manually in a good old Excel Sheet.

You can even do it manually. In fact, it is even better to start collecting DORA metrics without any tool. This way you will gain much more understanding and hands on experience.

But you can still try this trick. If there is no out-of-the-box tooling available, then suggest waiting for it. After all, you have a lot of other urgent tasks to do.

Tip #3. Never improve your code

Every professional software engineer should be able to write the 100% correct code first try. Right?

Never go the extra mile to make the code simpler. Even if you do not recognise it after one month. Remember, any attempt to make the code simpler will lead to a shorter LTFC the next time somebody tries to change it.

A good idea is to always ask for permission to improve your code from your product manager. How is he supposed to know if you need to improve the code to make it more maintainable? Isn’t part of your professional responsibilities as a software engineer? This will probably confuse him a lot and give you an excuse to avoid it.

Tip #4. Push for large releases

Being able to deploy to production on demand several times per day requires a different mindset.

One approach is to implement a feature in one go. Take a feature, develop it within a couple of weeks, push a huge MR of 10K lines for code review, wait for somebody brave to look at it, get the LGTM, deploy to the QA environment, perform the necessary tests, fix the defects and finally push to production.

The DORA way assumes a different way of working. A developer should deliver the task in several small batches instead. That means we need to have many quick iterations: code a change within a day or so, push a small MR, have it reviewed, deploy it both to QA and to production (hidden under a feature flag or by branch by abstraction), and test it.

When the feature is ready, switch it on in the QA environment (or even on production) and test it end-to-end. When we are happy about the quality of the feature we enable it for our customers by flipping the feature flag. The DORA metrics favour this “continuous delivery” approach as the DF and LTFC are much better in this case.

Dad’s holding his son in his hands. “Where is your dad?”. “Haha, here”. “Where is your mom?”. “Here!”. “Very good….”. “WHERE’S THE RELEASE?”.
[Source] Say that large releases are much more efficient. Block investments in automation and in engineering culture. Soon you will find yourself releasing every quarter and practicing waterfall.

So, always push for large releases. Say that it is much more efficient for you to sit and develop the whole epic in one go and then move it into the QA stage at once. Pretend that if you split the feature into many smaller stories this will take twice as long.

Remember that “continuous delivery” requires a high level of automation and a great engineering culture such as unit testing, refactoring, Boy Scout rule, etc. So block it as much as possible.

Tip #5. Always blame the human factor

Edward Deming once said: “a bad system will beat a good person every time” [Walt88 Ch4]. The majority of opportunities to improve lie in how the system works — rules, policies, agreements, habits, etc.

“A bad system will beat a good person every time” [E.Deming]

The data is usually available if we dare to analyse our past work. Very straightforward is to analyse the LTFC statistics and find out what are the most constraining factors. Most likely you will see that your deployments take too much time. Hence, you can implement deployment automation to improve it.

Very often, the bottleneck is your asynchronous code reviews. Another good idea is to analyse the Cycle time for your tasks (time between the team started development and the deployment to production). You do not even need to analyse all the tasks. Considering only the outliers every month will give you a lot of food for thought.

Please find below a few recommendations on how we can exploit this.

First, never do retrospectives. Or try to make them as useless as possible. Schedule an hour every other week. Do not conduct any analysis beforehand (LTFC stats, outlier task analysis, etc). No follow-ups.

Second, avoid making root cause analysis and taking data-driven decisions. Pick anything that is most obvious without diving deep. For any problem you can always find “a well-known solution — neat, plausible, and wrong” [H.L. Mencken]. If you had a production incident — suggest implementing an end-to-end test for it. If you have painful releases — decide to make them less frequent.

“There is always a well-known solution to every human problem — neat, plausible, and wrong” [H.L. Mencken].

Third, always try to bring any root cause analysis down to a human factor. Somebody made it happen, right?

Tip #6. Promote end-to-end tests only strategy

Find any incident with a regression defect. Start a root cause analysis. Make everybody agrees that if you had an end-to-end test covering this exact scenario then you would have caught this error before even deploying it to production. A solid argument that is hard to beat, so it usually works. It is actually correct.

“Good ideas often fail in practice, and in the world of testing, one pervasive good idea that often fails in practice is a testing strategy built around end-to-end tests.” [Wack]

The trick is that end-to-end tests are damn slow and fragile. So, if you try to cover all possible scenarios with them, then soon you will bring your team to its knees. Every deployment will take hours. Everyone will try to batch them or avoid them at any cost. Your DF and LTFC will degrade. But this is exactly what we need, isn’t it?

Not so fast Michael! It says you should write the test case before you can execute it.”
[Source]

There are a few caveats though.

First, make sure tests are working on top of a shared QA environment. The more teams using it — the better. This way you will make your test unstable without even bothering too much.

Second, it is advisable to include as many different services in the end-to-end tests preferably from different teams or even better, different tracks. Even though each team surely works very well and they break tests rarely, if you have many teams involved this will add to the fragility.

Third, reject any attempt to substitute higher-level tests with lower-level ones, especially unit tests. Imagine you have two components. One accepts a user request via its external API, does some processing and calls the second component. The second component processes the request and returns the response.

Suppose that there are 10 possible execution paths in either of these components. That means that to test the whole system you need to write 10x10=100 tests. You can substitute them with 10 unit tests for the first component, 10 unit tests for the second one and a couple of “integration” tests. But this will reduce the number of tests and their execution time and improve stability! So avoid this.

Tip #7. Push for 100% reliability

The quality means different things. Things like availability, latency, and durability are referred to as reliability metrics. The SRE book says that it does not make sense to try and achieve 100% reliability. First, after some point, users do not care as the reliability is good enough for them. Second, extreme reliability comes at a cost. Each extra “nine” you try to achieve — it takes you exponentially more effort.

Interfere with any attempts to adopt a risk-based approach to reliability. Do not try to define SLOs from the customer’s perspective at all, or try to make them as strict as possible. Aim at 99.99+% always.

Add a distributed cache for better latency even if you have relaxed latency requirements. Insist on canarying on both QA and production environments and manual verifications at each step. Pretend that you are doing this to increase the availability (does not matter that you are not gathering any stats). This will make it more difficult to change the system. People will try to batch deployments to make them less frequent.

Tip #8. Ignore good old books

R.Martin thinks that any professional should take full responsibility for their own career. You should be aware of all the basics of the profession and also about all the latest advancements. Your employer pays you for the productive hours of work and not for the time when you are trying to learn the basics. Yet, many employers do care about this and try to help people to grow as much as possible. This is a favour not an obligation [CleanCoder].

So, be as passive as possible. They hired you, right? Now they should care about your learning path. Instead of reading and learning about new technologies, methods or concepts, just sit back and relax until you meet them at work. The less broad knowledge you have — the better.

[Source] Keep the work-life balance.

Especially avoid the following books:

And there you have it! A guide on how best to not implement DORA metrics.

Of course, if you choose to ignore my tips, or do the exact opposite of them, then you might find yourself on the way to mastering software engineering excellence with DORA metrics. Good luck!

Special thanks to the team and for those who contributed to the article. Your help is greatly appreciated!

Interested in working with us? Check out our careers page.

--

--

Responses (4)