I have trust issues with automation.

4 min readJul 8, 2019

Occasionally, I have trust issues with automation.

Don’t get me wrong — I’m big on using automation. But when it comes to trusting automation running in production? That’s a high hurdle for me to clear.

If I don’t understand the automation I’m running, I don’t trust the automation I’m running.

Empire builders — not only the Romans

Remember the empire builders of days past? You know the types — they’d have legions of shell scripts that were complex and nobody knew what those scripts did, except them (naturally). They skimped on documentation — after all, they knew what their work did.

We let them get away with it because at the end of the day they got results. Things worked, automation ran, tasks were achieved. They took care of their empire so we could take care of ours.

Then they left. And as sure as the sun rises, what they wrote is now broken — probably because it required some very careful nurturing and management to keep functioning correctly.

Now some poor soul has to try and unpick the mess that was created.

It doesn’t get any better with modern automation tooling either — whether it’s Ansible playbooks, Puppet manifests, Chef recipes, Jenkins pipelines, or anything else.

YAML does not make for understanding

We’re replacing our custom shell scripts with custom automation in Ansible, Chef, Puppet, etc, but it’s a mistake to think that because these tools are easier to read, that the automation they perform is easier to understand.

You can write some utterly beautiful automation with tools like Ansible. Simple, straightforward, and reads well. Poetry in motion.

Conversely, I’ve seen absolute rats nests of automation written with Ansible. An unholy collection of playbooks, roles and undocumented variables and tags that made me dread opening the repository.

Modern automation tools are designed for scale — they bring the capability to modify tens, hundreds or thousands of servers during a single run. That’s a massive ‘blast radius’ if anything goes wrong — the recent Google Cloud outage is a good example of automation run a bit too far. And if it does go wrong, do you know enough about what that automation was doing to perform corrective actions?

Perhaps you have a general idea, but let’s take Ansible for example. Are there tags sprinkled throughout the plays and roles? Do you know what they’re all for? Are they all documented? What about their variables? Are they all documented?

It’s the corner cases I don’t know about that scare me. Running that against who knows how many servers in production…well, it’s nerve-wracking.

Automation for the people!…that come after you

When writing automation, remember: you’re not just doing it for you. You’re doing it for the people who come after you.

Will they understand what you’re doing? When to use it? How to use it? How likely is it they will misuse it (accidentally)?

Here are my recommendations to improve the longevity and trustworthiness of your automation:

Document it well, especially for automation that is run infrequently. The less frequently we run automation, the more likely it is we’ve forgotten all the nuts and bolts about how it works. Most importantly we will have forgotten the why of it — why it was written in the first place.
Never automate something you don’t understand. If you do not understand the manual process that you are automating, learn it until you do.
Watch the scope creep. Try to avoid doing multiple things at once with a piece of automation. Where possible, follow the Unix philosophy — do one thing, do it well. Document interdependencies, variables, tags, toggles, etc, thoroughly.
Understand that not everything needs to be automated. Every piece of automation you keep around, you need to maintain. Otherwise, what was the point in keeping it? Consider the cost-benefit relationship before you crack open the editor to create yet another playbook that will end up in a repository never to be run again. If you write it as a once-off to save you time — decide if you need to keep it at all. If you’ll never use it again, throw it away.
Take responsibility. When you write a piece of automation, take responsibility for it. That includes keeping it and its documentation up to date. Don’t toss it into a repository as a ‘commit and forget’.
Write your automation defensively. I am a big fan of the Ceph playbook that handles cluster updates (an oddly specific example, but bear with me). It operates over the cluster serially. It checks the cluster health before and after every host. It refuses to continue if the health isn’t appropriate. If a host fails, the blast radius of that failure is limited to the failing host. It’s defensive at every step, and consequently, it works very well.
Check for preconditions before you start your automation. Verify system health as you proceed. Never make any assumptions about state — check it before continuing. Fail early and fast if something isn’t right.

Automation: play the long game

Our modern automation tooling enables us to manage our systems at scale. The simplicity of just running a piece of automation can lull us into a false sense of security, a false sense of understanding.

By thinking carefully about what we’re automating, why, writing defensively and accepting it’s OK to not automate everything, we can ensure that the automation we do have stands the test of time.

I have trust issues with automation.

Empire builders — not only the Romans

YAML does not make for understanding

Automation for the people!…that come after you

Automation: play the long game

Written by Adam Goossens