DEVOPS AS A STEP TO GROW

Delivering value, not just faster!

How big companies release early and often to create a continuous stream of results based on control and observability

Guilherme Magalhães

Published in

What Really Matters…

14 min readAug 3, 2020

Tell me quickly what do you prefer, to delivery faster code or to delivery results and real value for all of your users?

Most part of the time, when we are talking about DevOps, unconsciously, we are pursuing practices to accelerate our software delivery and getting faster in delivering, but we focus so much in this set of practices that sometimes we forget about what it really matters.

This article is a proposition to have a more complete perspective about what we need to achieve to delivery the most focused solutions for our customers. Let’s use DevOps as a necessary degree to reach the top as an awesome company that is customer centric and completely data driven.

Is it possible to draw a line between what is a culture of continuous improvement, fast delivery and fast recovery and how they are related with successful enterprises?

It is possible, and it's not so difficult how you can imagine it can be. What is difficult in almost all of the cases, it's to see and understand this big picture. And who’s to blame for this?

Blame is something unnecessary and completely dumbed to be made, and in a stream of delivered value, there is never one person to blame or even a group of person. Frequently, fails in assumptions are related with a poor strategy of how a company can reach its goals. With a poor strategic vision, it is impossible to see what really is important when we talk about delivering results.

We have to face this misconception about how DevOps can be an enabler for a transformational path within the company, as we understand the stream of value delivered and understand that users are what really matters, we’ll be able to truly see its real value, not just a piece of technical tooling, culture evangelization or lightweight architectures.

DevOps practices are not the silver bullet for change in a company, not necessarily. It's better to understand it as a trigger to make the transformation in the right way. And how can it be done?

Through the right mindset and metrics, these are the terms that we are interested when we look for a picture of how we are and how we want to be, in short and long cycles.

The right mindset is the glue to keep it all together, if people lack the right mindset to change and the current organizational practices are flawed, DevOps will simply magnify those flaws.

Metrics are only pointers that indicates for too many directions. It should be used as a guide through the common blind process of making choices about the user experience.

It is essential to understand that the intrinsic value of these practices is in working with metrics, cultivating them, defining them, reaching them with the right mindset. This can be done in many ways and in different contexts. There's no cake recipe for DevOps and it won't exist. DevOps fails not as a practice, it fails as a misconception about what DevOps is.

Who dare says what is DevOps? DevOps is a set of practices that aim at reducing TTM by improving collaboration between Dev, Ops, Biz and more teams, if possible.

Presentation About DevOps — It’s all about data by **Guilherme Magalhães**

All the practices are related with optimization. In the end, we want to have less friction between teams, working on an automated stream of software packaged and delivered being able to solve any problem with a collaboration, trust and ownership culture.

To better understand the DevOps proposal, let’s break down some basic metrics, which we intend to achieve with its practices

TTM (Time to Market)

Delivering new features quickly and consistently is great, users happy with new possibilities, IT happy that their work was done and delivered in an agile way, Ops happy that you don’t need an event to publish the new code and Biz even happier because there is something new in the air, probably more revenue and certainly more profit! Phew, how much good stuff we have here.

Companies like Amazon, does the deploys every 11 second. TTM is an important metric. And as a metric, it needs to be evaluated so we can choose a path to ride.

MTTR (Mean Time to Repair)

Mistakes will always happen, this is inevitable, and often, it is even good. Only with mistakes, we can learn and avoid the same mistakes later on. But we can’t just salute the mistake and let the user handle it.

It is necessary a quick way to discover it, to isolate it and mainly to be able to make a quick decision that will bring some solution to the user.

MTTR is the proposal to reduce the maximum repair time for a problem, whatever it is. According to the DORA 2018 Report, Elite performers have an MTTR that is less than 1 hour and Low performers have an MTTR that is between 1 week and 1 month.

Change Failure Rate

The change failure rate is a measure of how often deployment failures occur in production that require immediate remedy (particularity, rollbacks).

To obtain this information in an AWS-native manner, you track each deployment and indicate whether it was successful or not. Then, you track the ratio of successful to unsuccessful deployments to production over time.

According to the DORA 2018 Report, Elite performers have a change failure rate between 0–15% and Low performers have a rate from 46–60%.

Those were some of the metrics and they're all related with optimization. Only through optimization we will get to it.

Companies are worried about optimization, but optimization by itself is not enough.

The average age of a company listed on the Fortune 500 has fallen from almost 60 years old in the 1950s to less than 20 years currently. This is a very short period of time and several companies fall into that line.

However, there are some other companies, which manage to survive and stand out from the others. What do these companies have in common to survive and remain strong?

This is a question that covers many concepts and several proposals, in the book «the innovator’s dilemma the revolutionary book», it is commented how disruptive technology is a hard task to be accomplished.

What most of the companies are able to achieve is optimization and innovation in their already established market products.

DevOps market size is really growing fast, in 2017 was spent over USD 2.90 Billion. It sounds great when we hear about it, right? Oh, in 2023, the expectations are under a spend of USD 10.31 Billion, it's an incredible increase of amount. 🤩🤩🤩

We spend so much time, money and effort in optimization and innovation in what we already know we have to. What about if we could learn the things we don’t know that we don’t know. Probably, in every sector, industry or business there is a huge hidden knowledge that we will reach by only trying a new approach on fast learning and discovering.

“True optimization is the revolutionary contribution of modern research to decision processes.”, George Dantzig

In the ideal world, CI/CD promises to shorten the feedback loop between production and the development process and to allow developers to optimize performance without long wait times or switching context. This is not surprising, since most part of digital transformation projects focus on optimizing CI/CD tooling.

In practice, the DevOps approach still has some shortcomings when we talk about tests. Testing is focused on automated testing — on technical checks of developers’ output (the code) — rather than checking for improvements in business outcomes (conversion, revenue) or impact (market share, profitability, customer satisfaction).

Relying on automated testing in CI/CD leaves behind exploratory testing. As a result, the net effect of many CI/CD efforts is that business-oriented engineering is cut out by automating just the low-hanging fruit in CI/CD.

Or even worse, Testing and Acceptance are skipped entirely by letting developers write code and put it into Production based on developer-led testing, cutting out the checks and balances QA would provide.

DTAP — or at least having separate environments for testing (T an A) — is an anti-pattern in an Agile environment.

Features are pooling up in an environment until the team feels ready to deploy to the next phase. The bigger the pool, the trickier this becomes. The deployment will take longer, so much configuration and all the test suit has to be performed on the new environment and there is an increasing chance of bugs and errors as this process occurs. A vicious cycle tends to form as teams postpone deployment in favor of adding more new features.

DTAP-pipelines encourage the creation of queues. Agile software development, on the other hand, encourages the removal of queues. Generally, queues are the sources of most of the problems related with waste.

Should we stop thinking about DevOps and go back to Waterfall as a planned and structured way to achieve the company’s global goals, which normally don’t vary that much?

Oh No Deal With It - GIF By DreamWorks Animation

A lack of DevOps practices can submit a team for the following and ancient problems:

Longer cycle times
Bugs and more issues
Longer process-solving process
Instable operation
Incompetent team communication
Harder to manage, so much harder, oh Gosh

So, what is left beyond sadness and loneliness?

Progressive Delivery or Continuous Releasing, this is what the giants of the web are doing to maintain an incredible, resilient, reliable and full of new delightful features for us

Progressive Delivery is about doing the control of a release and to observe the impact of this release at the user level. It is necessary to see the whole as a great stream for delivering value to the end user.

Progressive Delivery allows teams to validate software and business outcomes by giving them control over what is released to users. That control comes through setting release policies. These policies govern the conditions under which a new version is released to users, and in what steps.

Conditions can be as simple as a percentage of traffic, or as complex as a selection of users based on location, device type, or login status, as well as business conditions such as “has ordered Product X before.”

Using DevOps practices to support the consolidation of the company’s global metrics is the way forward. And these metrics, of course, will need the process of delivering quickly to compose this broader view.

Linkedin, Netflix and Facebook are releasing code faster and faster and more importantly, they are experimenting while they are delivering software, as they watch and learn, they are able to understand what experiment is the best to be elected to be the right approach for the assumption or solution that they’ve crafted.

There are three pillars for what we call progressive delivery.

Manage

Manage is the ability to control the exposure of our features.

What users will have early access? How we do a rollback when something fails or goes wrong? How can we fail and not lost confidence or users?

Deploy has no impact for the user as it doesn’t change everything every commit, everything is managed to have control about every aspect of the software lifecycle.

Deploys happens more frequently and with less risk to business and to end users, as we can be sure of how the feature is operating for each user and how we can recovery by switching for the last version of the new feature.

Feature Flag

The “Feature flag” standard allows you to activate and deactivate the features directly in production without the need to upload a new version.

The mechanism is very simple, you only have to condition the code execution of a feature:

Feature Flag by example

The implementation of the `is_enabled` method, for example, will check a configuration file or consult a database to find out if the functionality is active or not.

An administration dashboard usually helps in the process of hot-swapping the status of the different “flags”.

A natural evolution of Feature Flipping is the ability to turn on/off some features for different populations:

A group of “guinea pig” users: who will give feedback on the new features;
Other users: who will use the previous version until the feature is enabled for everyone.

The code will look like the one shown below:

Advanced Feature Flag by example

This mechanism allows, for example, to test the performance of a new implementation by comparing the results of different populations. Outcome measurements can help identify which implementations are most efficient.

In another way, Feature Flipping is the ideal tool to perform A/B tests.

Monitor

Monitor is the ability to do all the experimentation as a safety net.

Some features are more critical than others for the business. When there are higher loads on the technical infrastructure (Black Friday for an e-commerce site for example), it makes sense to favor some over others.

Unfortunately, it is generally difficult to have in the application the function to turn off the “query functionality of the graphs of synthesis”… unless this “query functionality of the graphs” implements Feature Flag.

We have already said that it is important to measure and have metrics. When you have measurements, it is very easy to turn off a feature, for example “the average reactivity time is more than 10 seconds for 3 minutes”.

This allows to progressively degrade the application’s functionalities to favor the users’ experience in using the essential functionalities.

Fault tolerance

This idea of “fault tolerance” is similar to the “Circuit breaker” pattern — which Michel Nygard talks about in his book «Release It!»- to turn off a feature when a service does not respond.

If something obvious fails, the circuit breaker will be there to recovery from the problem and not blow everything in the face of the user. This kind of recovery usually takes one second to happen 🥳.

Now tell me who will notice a fail in one second and the recovery of the system?

And as you should notice, all the fails will be reported for the team that enabled the new initiative for the users. Team will work on the fix and in the next time, the wheel will spin again.

Severe degradation will automatically abort the experiment.

What we are seeing in here is a boost in our agility to delivery the functionalities and to recovery from any problem, and most important, we can understand the behavior of our functionality as it grows in usage.

As far as we enable the feature, we can watch as our users use it, how the infrastructure is performing, how security is handling its concerns, detecting unforeseen issues before it degrades our whole system just by monitoring and limiting how the access happens for our users.

Each new feature is in its own box, this box is wrapped, so it can be analyzed as a secure experiment, isolated from burning the previously system behavior.

Experiment

Confirm the impact of new initiatives with statistical rigor before declaring something "done" and move on to the next.

Here we look to measure in a controlled manner, the impact of changes have in the user behavior. The measure is done through a clear and defined objective of the experiment. The experiment needs to tell exactly how the hypothesis is going to solve a problem or improve some user experience. This is essential part of an experiment.

The quicker we manage to validate new ideas will be the less time wasted on things that don't work and more time will be left to work on things that make a real difference.

This is the only way to overcome de HiPPO (highest paid person’s opinion, highest paid person in the office). Which means, there will be a new way to make choices, and it won't be from the manager/director who thinks that only releasing software as often as possible is the shot to take.

That which cannot be measured cannot be improved; without measurement, it is all opinion.

The data will be the navigator that will guide us through an assertive path, better known and with guarantees of success much greater than someone else assumptions.

"With 5+ experiments that can support it, a customer theory then becomes a key insight." Brooks Bell, Founder and CEO of Brooks Bell

Experiments with no automation tends to fail over time. There will be nobody available to do the regular analysis always when an experiment happens. In small companies, people will be concerned about other tasks and in big companies, there will be so much experiments that a team with 50 people won’t be able to handle it.

When all the experiments are automated, a practice of evaluating and give feedback is essential. Check what was the target of the experiment and what analysis should be done. This practice must be rock solid within the company.

The output data of the experiment must be for diverse teams, not just for data scientists or specialized ones. Biz, dev, ops, analytical and everyone who is participating in these chain, must understand the data and be ready to give an opinion or feedback on the experiment itself. The decision now belongs to everyone with decentralized and understandable data.

A/B Testing

A/B Testing is a product development method to test a given functionality’s effectiveness. You can thus test e.g. a marketing campaign via e-mail, a home page, an advertising insert or a payment method.

This test strategy allows you to validate various object releases for a single variable: the subject line of an e-mail or the contents of a web page.

Like any test designed to measure performance, A/B Testing can only be carried out in an environment capable of measuring an action’s success.

Let me take the example of a subject heading in an email. The test must bear on how many times it was opened to determine which contents were most compelling. For web pages, you look at click-through rates; for payments, conversion rates.

Where do you set your cursor between micro-optimization and major overhaul?

All depends on where you are on the learning curve. If you’re in the client exploration phase, A/B Testing can completely change the version tested.

For example, you can set up two home pages with different marketing messages, different layouts and graphics, to see user reactions to both.

If you are farther along in your project, where the variation of a conversion goal of 1% makes a difference, variations can be more subtle (size, color, placement, etc.).

How will you define the various sub-sets?

There is no magic recipe, but there is a fundamental rule: the segmentation criteria must have no influence on the experience results (A/B Testing = a single variable).

You can take a very basic feature such as subscription date, alphabetical order, as long as it does not affect the results.

How do you know when you have enough responses to generalize the results of the experiment?

It all depends on how much traffic you are able to generate, on how complex your experiment is and the difference in performance across your various samplings.

In other words, if traffic is low and results are very similar, the test will have to run for longer.

Progressive release extends the promise of continuous delivery to improvements in business outcomes by inserting a customer centric approach at the end of the release pipeline, where it can now be carried out safely in production stage.

This is the marriage of values that we seek, to deliver fast software, to have short feedback loops and thereby understand different types of metrics.

Metrics to measure user behavior are what bring the most value in the end, they have an impact on the company’s global metrics.

There is a hard and long path to become a data driven company, but if you think that this is the holy grail you've been looking for, you made a mistake. There will be one more step to heaven, this step comes with Artificial Intelligence and how to automate many chooses with the usage of AI. Amazing, but subject for another round.

Last words…

Ultimately, the practices, values, tools and architectures for implementing a “DevOps project” are not the factors that will leverage the company to keep it healthy and eventually put its foot on the Fortune 500 list.

Focus on the customer, with broader views, understanding your needs, your demands, your experience is a much more relevant path towards this goal.

DevOps is the best weapon to operationalize and to deliver faster, to learn faster and to act even faster as a matter of fact.

Being fast is essential, but it's not enough to be fast, it's necessary to move forward.