DevOps Culture

Maria Valcam
Quiqup Engineering
Published in
6 min readAug 22, 2018

Before you start reading this blog post, I want you consider this question “Why do you need DevOps?” maybe it will help you answer first “What is a DevOps?”

If you think DevOps are just the magicians that handle your infrastructure and services, you should continue reading this blog post to learn what “DevOps” really means.

Why DevOps

DevOps was born because Devs wanted to ship their features fast to lead in the marketplace and make customers love their product. On the other side, Ops and QA wanted to improve stability over anything else, and new features mean instability. This creates a big conflict within the company.

We needed to work together towards the same goal. Companies began to study how to achieve this and they copied some of the Lean principles to eliminate bottlenecks and enhance productivity. From there, concepts like Agile, continuous delivery and DevOps were born.

Note: Lean is a management philosophy used by Toyota manufactures that allowed them to grow from a small company to the world’s largest automaker in the 1990s (beating all their competitors).

So what is DevOps?

In the State of DevOps report 2017, DevOps is described as:

DevOps is an understood set of practices and cultural values that has been proven to help organizations of all sizes improve their software release cycles, software quality, security, and ability to get rapid feedback on product development.

In this same document, they also compared high performing companies vs lower performing companies. Where they show that high performers reduce speed (or lead times) while improving stability.

Keep in mind DevOps is not a team, it is a culture, so engaged leadership is essential for successful DevOps transformations.

DevOps Principles

As DevOps, our main goal is to help make software delivery fast and reliable. Gene Kim, John Willis, Jez Humble and Patrick Debois defined 3 ways to reach this goal in their book The DevOps Handbook.

1 — Flow of work: This principle imagines a continuous release of features to our customers. Our goal here is to reduce our lead time. Practices that help reducing lead times are:

  • Eliminate constraints: All the work in the workflow should be identified. In general, this includes: design (queue, analysis, work and approvals), development (queue, estimation, development, tests and approvals), QA (queue, automated tests and manual tests) and deployment (queue, ops work, approvals, deployment, verification). Once we have the whole workflow outlined, bottlenecks can be determined and reduced (or, better yet, eliminated).
  • Continuous delivery: Makes deployments part of the team’s daily work. It is usually achieved by using continuous integration, reduced batch size, easy rollback etc. Its main objective is to make releases less scary, so that we can deliver frequently and get quick feedback on what users care about.
  • Reduce Waste: This includes decreasing the incidence of partially completed work, extra processes with no value, unnecessary features, task switching, blocked tickets, motion, bugs (the longer it takes to find a bug, the more expensive it becomes to fix it), manual work (or toil) and heroics.
  • Improve our daily work: By accumulating problems and technical debt, we can end up just performing workarounds. Mike Orzen observed that “Even more important than daily work is the improvement of daily work.” At least 20% of all development and operation cycles should be invested on refactors, automate work and NFRs.
  • Integrate designated Ops and QAs into dev teams. So you end up with independent teams that don’t need to open tickets to other teams to complete a feature. The structure of the teams is highly important, according to Conway’s Law “organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations”.

2 — Feedback: In order to build a resilient system, we should increase the information flow from as many areas as possible (sooner, faster and cheaper).

  • Create telemetry and analyse it: We should have a list of events for our system at business level, application level, infrastructure level, client software level and deployment pipeline level. This should be analysed and alerts should be raised if they don’t behave according to a pattern.
  • Test releases and probe hypothesis: By enabling different types of deployment (canary deployment, A/B testing and Blue-Green).
  • Continuous integration: Where each release is verified by an automated build that should detect errors as quickly as possible. This allows devs to get quick feedback to fix bugs as early as possible. Devs should be able to run unit tests, acceptance tests, integration tests (and have a visible test coverage). We could also perform performance tests by running the same integration test multiple times in parallel. Static code analysis could be also automated for security or clean code.

3 — Continual Learning and Improvement: Focuses on organizational knowledge and aims to teach people how to think - changing behaviour creates culture. To achieve this we need:

  • Safety culture: This type of culture makes it easier to give your boss bad news, without fear, so that everyone can focus on what caused a problem instead of who caused it. The boss should be also able to share bad news: workers are problems solvers and can help just if they are told the problem. We should also perform blameless postmortem to incentivize learning rather than punishment, avoiding solutions like “be more careful”. Google uses something called “error budget” to allow certain level of errors to happen, so they can test hypothesis and learn from their failures until they meet their error budget.
  • Test resilience: One of the main example for this is Netflix’s chaos monkey. This ensures resilience by testing it. Netflix’s idea is that practice makes perfect, so the only way to become better at failing is by breaking things.
  • Convert local discoveries into global improvements: Achieved by adding telemetry or creating documentation for it. Postmortems should be available for everyone to read and learn from it.
  • Plan training: This could be done within the company, where a team teaches new concepts or skills to other people, and by access to courses, conferences or meeting weekly to read and comment a book.

Conclusion

At Quiqup, we meet some of this good practices: we have lightning talks, automated pipelines, blog posts, Datadog, Tableau.. but we still have work to do when it comes to resilience testing, autonomous teams, A/B testing, reducing manual testing etc.

It’s totally fine, and completely normal not to be perfect, but we should always work towards improving our practices and culture. Different teams and leaders promote some of this practices and the DevOps team tried to meet some, but companies need to work together as a whole to meet these objectives!

> Help us become better as a company! we are looking for a Lead DevOps Engineer that can ship all this ideas within Quiqup. Click on this link to know more: https://www.linkedin.com/jobs/view/devops-engineer-at-quiqup-829932209

Originally published at stories.quiqup.com.

--

--

Maria Valcam
Quiqup Engineering

Engineer with an MBA. I am interested in Business, Doversity and Engineering.