[RELIABILITY] How to build a top performing operations team?

Daniel Moldovan
DevOps Dudes
Published in
5 min readAug 10, 2021
Image showing Bear Grylls pointing with his finger. Text in the top of the image states “when the operations team is overloaded so you mak everyone part of you operations team”. Text in the bottom of the image states “Improvise. Adapt. Overcome.”

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients.

And now you need to operate it. To ensure when things break, issues are fixed as quickly as possible. To ensure that your service is properly scaled for expected traffic patterns. Basically, to ensure the service is up and provides the best client experience.

You start looking through how to build a great team to operate your product. You know about operations. You have a team of SysAdmin roles maintaining your current IT infrastructure containing code repositories, LDAP servers, etc. You read about Site Reliability Engineering. You read about DevOps.

But what is everyone really talking about? Is there really any difference between what operations teams used to do and what reliability engineering teams do? Basically, what should your operations team do?

Strategy 1: Build an infrastructure-oriented team concerned with infrastructure health

  • Ensure the team is concerned with maintaining and ensuring the infrastructure is up and running. The relationship between infrastructure and business should be indirect and unspecified. If the infrastructure is up, then everything is assumed to be ok. Ignore the application running on top of the infrastructure. Wonder why clients complain to you about bad experience, when the infrastructure is running within parameters.
  • Consider each piece of infrastructure a pet. Each piece of infrastructure is precious and needs gentle human touch. Infrastructure configuration is like brain surgery. One host might need the OS kernel recompiled with certain flags. Another might be from a vendor known for running the CPUs too hot, needing extra care when assigning load to it. Wonder why you are so slow in scaling infrastructure and cannot easily react to client demands.
  • As each piece of infrastructure is special, you should add a lot of low-level alerts. Alert when CPU/RAM/DISK usage is high. Alert when any machine is running at a high temperature. Alert when the secondary switch on the rack failed. Alert when the rack power supply lost redundancy. Wonder why as your product’s scale grows, your on-call team is using all its time acknowledging infrastructure alerts.
  • Ensure your team is focused to fight everything hurting the infrastructure. And what is usually hurting infrastructure? The product code of course. Hear a lot of: “engineering can’t write code that does not have memory leaks”, “engineering can’t write code that does not fill up storage ”, “engineering can’t write code that balances load properly”. If only engineering would stop writing code, the infrastructure would be fine. Wonder why operations and engineering do not seem to get along.
  • Because software is the one usually hurting the infrastructure, make sure the team understands that software is the enemy. This means automation is also an enemy. Automation and software cannot be trusted. Manual intervention is the only one that can be trusted. Humans should have the full control and knowledge over how things should be done. Wonder why it is so hard to scale operations as your product grows.

Strategy 2: Build a reliability-oriented team concerned with client experience

  • Ensure the team understands that its job is not to keep the infrastructure up. It is to ensure the client experience is always within specifications. Your team might manage a client-facing service. But it might also manage an internal platform or product without direct connection to clients. It does not matter. The team’s job is to understand the business and how everything ultimately serves the client. Wonder how all the teams are focused on providing the best client experience.
  • Ensure your team works to: (i) reduce how often problems occur (incidence rate), (ii) reduce time to detect problems, (iii) reduce time to recovery from problems, and (iv) limit the impact of problems (blast radius). This means the team must understand the business and how things running on top of the infrastructure are used by clients. The team must understand how client usage patterns affect the infrastructure. How software performance and infrastructure relate. Wonder at the high client satisfaction you are noticing.
  • Achieve reliability through dedicated development environments. Ensure development environments can be created on the fly by developers. So new features can be developed and tested in isolation without impacting others. Ensure development environments can be scaled on-demand. So engineers can performance test code before it gets in production. Wonder at the high experimentation possible during development.
  • Achieve reliability through CI/CD. Ensure continuous integration mechanisms are deployed. So that new code is continuously validated against tests to find bugs in the added code and in the communication with other services. Ensure any new feature is validated through a battery of unit, integration, and end-2-end tests. To decrease the chance of bugs surfacing in production. Ensure continuous delivery mechanisms are deployed. To ensure new code gets in production as soon as everything is tested and validated. To ensure new features get to be used by clients as soon as possible. To ensure changes released to production are small, thus limiting the blast radius of any potential issue. To enable easier and faster rollback. Wonder at your high uptime and high client satisfaction.
  • Achieve reliability through monitoring, logging, and alerting. Ensure the code has logging added and configured to the correct levels. So when things go bad, there is sufficient information to quickly detect the problem. To to enable debugging production issues or client complaints. Ensure the code is instrumented with enough metrics providing visibility in the product’s behavior. Ensure that metric statistics are such as percentiles are available, to quickly determine when the product is behaving abnormally. Ensure alerts are defined so you get notified when client experience is impacted. When an alert fires, there should be a 90% chance that there is ongoing or imminent client impact. Wonder how fast you can detect and trace problems to their root cause.
  • Achieve reliability through standard operating procedures. A traditional operations team might be able to function as independent individuals, each fixing infrastructure pieces in their own way. A team focused on client experience usually can’t. Maintaining client experience requires consistency. Ensure standard operating procedures are documented. Ensure all team members have been onboarded in these procedures and that they follow the same operating procedures at all times. So that any on-call team member can recover things quickly and correctly. So that the client experience is kept consistent across problems and incidents. So that client experience does not depend on the availability of a few highly trained individuals. Wonder how fast you recover from problems.

So. Can one team do all of the things in Strategy 2? What team would it be? What skills would it need?

I do not view reliability as a job code. And finding people who can do all of the above is almost impossible anyway. It is easier to see reliability as a mindset. A “hat” that anyone in engineering and operations teams can pick up when needed. So, present the reliability hat to your team. Let team members pick it up when suitable. Have engineering work one sprint on monitoring. Have a system administrator work one sprint instrumenting application code with metrics. Have both engineering and operations work on standard operating procedures. Rotate the hat among teams and members. Make everybody involved in your product a member of your top-performing reliability-oriented team. Get a strong team. Get a strong product. Get strong client experience.

--

--

Daniel Moldovan
DevOps Dudes

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.