SRE, noun. See also: confidence, trust.
“Move fast and break things”. A lot of engineers have learned this motto by heart and go by it in their everyday work. Unfortunately, not as many engineers (want to) remember the full version:
Move fast and break things, but always know how to fix them.
Building infrastructure services is not an easy task. These services are going to be a foundation of tens or even hundreds of other services in our companies. Foundation means confidence and trust, confidence in the platform and core services it provides. One can’t build a product on a shaky foundation and groundless expectations that something is going to work unconditionally just because it is expected to. We all build our systems on top of the hard work of other engineers, be it Linux or Cassandra or Elasticsearch or Docker or whatever else. But I’m sure it’s not a secret that this work is by no means going to magically suit our needs and usecases — take Graphite, for example. It’s a great piece of technology used in a lot of different environments in a great many of businesses around the world, but at some point it just no longer works.
Truth is, nothing will work out of the box after a certain scale — that’s the harsh reality. Our companies are growing so fast that quite soon they’ll outgrow the common expectations the majority of open-source systems were tuned for by default. Cassandra will exhaust I/O capability and stall, Elasticsearch will render CPUs dead, HTTP-based RPC will saturate the network and will start to experience packet drops on TORs even inside DCs. Once the storage layer reaches a certain capacity, disks will fail every hour, we’ll see memory randomly flipping bits because of the solar wind (I’m not kidding) and an extremely rare data race will be a synonym to a segfault. This will happen, and we can’t do anything about it.
With the scale our companies are going to reach soon, one in a billion chance means next Thursday.
We have to build confidence in our core platforms. It doesn’t mean that we have to make sure that things don’t break — this is a fairy tale, an impossible dream. Things will break more and more often. We need tools and processes that’ll help us be aware of what’s going on and be ready for these kinds of troubles, be confident about our actions and know exactly how to mitigate the impact of these unpreventable events.
While it might seem that our companies are still young and small, that won’t be the case quite soon. Multiple datacenters or tens of thousands of AWS nodes, custom protocols, in-house replacements for all common pieces of infrastructure — this requires a certain operational culture. One can’t just deploy shit to production. One can’t just pick some datastore and start writing data to it. I’ll say it again: the majority of all those open-source products are not even nearly meant to be used at such scale.
We got to build operational expertise, that’s the basis for the trust one would expect from the platform. It should be clear that it’s impossible to hire somebody who knows for sure how to tune, let’s say, Redis for your company’s specific needs. This can be only learned in a real actual environment. That’s hard and painful, and I’m quite sure that a lot of engineers wouldn’t like it. This means launch checklists, mandatory load testing and chaos monkey scenarios. This means QA, acceptance tests, staging, and no root access to production. This means immutable infrastructure, isolation and strict resource control, for all I know.
Trust can’t be downloaded from GitHub or installed with apt-get, but it can be built by trial and error, by improving visibility and maintainability, by automation and testing and, most importantly, by accumulating and sharing expert knowledge and culture of reliable and predictable infrastructure.
Tools shape the mind and processes shape cognition.