Burning Houses and Happy Operations

Steve Casale
Built to Adapt
Published in
7 min readApr 27, 2016

IT operations never had it easy. They get noticed when things go wrong, which can landslide into long nights, lost weekends, and take out. In a specialization still more associated with the dive-and-catch and death marches, caricature or not, at its core this image carries a hard nut of truth.

Operations need not be that way, says Pivotal’s Topher Lubaway, an operations person who plies his trade on the continuous part of Cloud Foundry, keeping the platform’s services up and running. Where many see burning houses and a stormy horizon, he and his team see and practice something else, piercing the skies to brighter days through automation and subduing failure with a bear-hug.

I recently caught up with Topher at the Pivotal San Francisco offices where we talked about fighting fires, DevOps, automation, and sleeping better at night. Below is an edited excerpt of our conversation.

You’ve compared the conventional way of doing operations to fighting house fires. Explain.

This is an industry term that we use a lot — we fight fires, or fire fighting becomes the blocker for us and anyone in software operations. The analogy means that as software engineers we build houses or structures for other things, just like any engineering team would. But as the house gets bigger and bigger, not everything works out according to plan. Eventually, your job becomes addressing its problems, such as a fire in the kitchen.

Over time those fires expand. They become your entire workload, so you’re never building a new room or new feature that keeps customers happy. All you’re doing is putting out fires by fixing the existing structures in your house.

And that becomes everybody’s day job.

Exactly. Just keeping the systems running becomes 100% of your time synch. It makes it harder to do the things that make it easier for everyone to deploy, for production to remain stable, for new features to be added — or all the things we want to do that serve business needs and customers, are prevented because we’re fixing all the things that are broken.

So how do typical operations teams deploy?

Fires have a lot to do with the expectations of many of the large corporations we work with. You’ll see teams deploy on Fridays, because in a perfect world, if the deployment goes well, Friday provides the smallest amount of downtime, when less people are using a product, which increases the chances of deploying smoothly. In this world, failure in production is a possibility, but it’s not an option.

What really happens though, is that deployments do not always go well, something does break, and you have an entire floor of really annoyed engineers working on Saturday, not seeing their kids. That burns them out, which makes products more likely to fail, so the next Friday deploy and the one after that creates even more problems, and the problem continues to stack exponentially, while they keep trying to fix it with the same approach. It’s really kind of crossing your fingers and hoping that the failures are not terminal.

What’s the difference with how Pivotal teams do operations, or DevOps?

We live our lives at Pivotal under the expectation that there will be problems, and that something will fail at anytime in production. At its heart, DevOps — and Agile means trying things in production over and over again, with the understanding that you’re going to mess something up, but that’s OK. We’ll get better, and we’re building a system that is built to respond to failure, both in how we test and deploy, and what we deploy on.

In some ways, the platform works because it is the sum of learnings from little failures, which lets us build something that constantly adapts.

Where does the platform like Cloud Foundry come into play here?

We’re very big on automation. At its core, a good platform enables better response times, which means happier developers. It has the tools to take on a host of necessary but lower-value processes, automatically. We use things like Concourse, which is a pipeline that allows you to enable processes and any function happen automatically from any other function. We can see and immediately fix a problem, without any human intervention, and have that product alert us that something happened, so we know it wasn’t supposed to — or if it was and requires intervention, we can make an immediate adjustment.

How does all this change how you understand and approach failure?

Every component of our system is designed for failure. We expect a product to fail at anytime and that’s OK. Our deployments are essentially planned for failure to some degree. We often say hey, if you could shut down now that would be great, because we’re about to destroy six of your instances and bring up another six, and the end user will have absolutely have no idea that anything has happened.

That resilience also means when a VM goes down completely and unexpectedly because an IaaS fell down, or there was an outage in Texas, we say, whatever, we’re fine, because the system is resilient enough that there is no individual impact. There might be a user somewhere that experiences something small, but the system as a whole is incredibly resilient.

Why not just do it yourself? What’s the advantage of Cloud Foundry over the DIY road?

Cloud Foundry is deployed across the world. As an open source project, and we strongly believe in the open source community of committed developers who learn about and give us feedback on the platform all the time, on how to make it better for them and for us. And our widely adopted commercial distribution, Pivotal Cloud Foundry, builds on that. This is a well documented platform that more and more people understand everywhere.

Many companies build their own cloud operations structures, and even if it is the most beautiful snowflake in the world, it doesn’t help when they hire someone internally. They probably have to spend six months getting used to the internal product, and that’s assuming it is a beautiful castle of impeccable design — but that still doesn’t mean anyone outside of a relatively small number of people will know anything about it.

How does DevOps change what operations teams build and how they work together?

It gives us and others time to build things we wish we could build.

If you spend most of your time fighting fires, all you build is a better fire extinguisher. We can build things developers are interested in building to make their lives easier.

That makes everyone happier. In fact, the focus of my team is to give ourselves less of a workload. We are always striving to to do less of some kinds of work through automation.

The entire team is behind the process, and built to expect failure without it being a stressful or a weird situation.

Does this approach helps organizations keep good operations people?

I think part of the reason it’s hard to find good operations engineers is that they leave. If you’re in a typical operations job you get into trouble for all kinds of things that aren’t really your fault. You’re paranoid that any crash or instance of failure in production is going to be blamed on you. A contractor we had in once told us of a colleague who was fired for an outage, on essentially what was a typo. Can you imagine living your life and every time you press enter you can potentially get fired? That’s insane!

Here we have retros about a production that did terrible things. We don’t point the finger, and we don’t blame. It’s not about that. It’s about understanding how that typo (assuming that it was a typo) was allowed to get into to production. We understand that you had the best intentions, and we work together as a group to iterate on this as something to avoid in the future.

How do you sleep?

I go home at six every night. I have never worked after six, and I come into work refreshed and ready to make my life and the lives of everyone who uses Cloud Foundry, easier.

We work together on our team to make sure we all go home at six, and have sustainable lives. Sustainable is a word we talk about a lot in my group. You need to have a sustainable life in order to be able to contribute at your fullest to making our products better.

Change is the only constant, so individuals, institutions, and businesses must be Built to Adapt. At Pivotal, we believe change should be expected, embraced and incorporated continuously through development and innovation, because good software is never finished.

--

--

Steve Casale
Built to Adapt

I write to keep the wolf at the door. I’m a scribe and editor at Pivotal.