SRE, noun. See also: confidence, trust.

Move fast and break things, but always know how to fix them.

Building infrastructure services is not an easy task. These services are going to be a foundation of tens or even hundreds of other services in our companies. Foundation means confidence and trust, confidence in the platform and core services it provides. One can’t build a product on a shaky foundation and groundless expectations that something is going to work unconditionally just because it is expected to. We all build our systems on top of the hard work of other engineers, be it Linux or Cassandra or Elasticsearch or Docker or whatever else. But I’m sure it’s not a secret that this work is by no means going to magically suit our needs and usecases — take Graphite, for example. It’s a great piece of technology used in a lot of different environments in a great many of businesses around the world, but at some point it just no longer works.

With the scale our companies are going to reach soon, one in a billion chance means next Thursday.

We have to build confidence in our core platforms. It doesn’t mean that we have to make sure that things don’t break — this is a fairy tale, an impossible dream. Things will break more and more often. We need tools and processes that’ll help us be aware of what’s going on and be ready for these kinds of troubles, be confident about our actions and know exactly how to mitigate the impact of these unpreventable events.



Gave up my sanity to save my soul.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store