DevOps — where’s the light switch?

Robert Sweetman
Version 1
Published in
5 min readSep 8, 2023

One of the fun aspects of being a DevOps engineer is that there’s much surface area to get a handle on. There’s AWS, Azure, networking, OS level stuff, python, bash, PowerShell, terraform, ansible, monitoring and we’ve not even mentioned containers, let alone serverless.

Now, let’s not deny that continually getting to mess with new things is great but at the same time, you’re always smashing your head into something…

Over the last few months I’ve been trying to come up with some generic strategies to shorten the amount of time it takes me to achieve a goal.

I certainly began my career writing code and expecting it to work… I’m not embarrassed to admit that while this is “an” approach it definitely has side effects. Among these are self-doubt, frustration and off-the-scale imposter syndrome.

Here are some things that I try to consider before I start typing anything, especially if what I’m doing involves sending messages or values for parameters across any sort of boundary.

Let’s avoid chucking things into the void until we’ve found the light switch!

Where are the logs?

So many times when you’re trying to automate something, from installing a program using Ansible to running a command “at” AWS, you’re effectively firing off random commands into a dark room.

Yes, you should ultimately be able to see “thing” appear on the machine or in the UI but what if it doesn’t? Going round and round a loop of (somewhat) randomly trying things can feel like you’re doing something but it’s not particularly going to lead to calmly moving forward…

It’s worth investing the time to understand where the results of your vague hand-wavey actions are going to end up being logged. This will avoid a huge amount of frustration on an ongoing basis. It won’t be the last time you’ll be attempting to do this particular thing so might as well figure it out once and for all.

This goes double for when the implementation/feedback loop is longer than a couple of minutes and yes, I’m looking at you AWS Ami builder…

Can you shorten the feedback loop?

Waiting for things to resolve can be painful, the longer it takes for something to occur the more painful it’s going to feel and in some ways, it’s an inhibitor of changes & improvements.

It’s challenging to be enthusiastic about small wins if each go-round takes hours. I mean, it may not be ‘strictly’ broken but if making it better involves waiting for 90 minutes to see if something worked and rolling it back takes (nearly) as long, that’s additional friction to overcome.

If there’s no way to speed things up, is there a way of validating changes in a more easily controlled and faster environment?

One thing I spent time on was building a Rhel 6 container ’cause throwing Ansible code at that and seeing it fail was way faster than ssh-ing onto a machine and trying not to break a live instance.

Obviously, you still need to test what you’ve done but quick, local iterations to get something close enough is much better than building a tower of babel in the cloud and sacrificing a chicken while anticipating that one to two hours later you won’t discover that you missed out a closing bracket somewhere.

Did ‘thing’ cross the boundary?

Often you’re passing values (tags, parameters, messages etc.) from something in one place, hoping it’s all got the correct permissions, for it to be picked up somewhere else.

First, see if there’s a way of just establishing the target of whatever is even contactable from wherever you’re starting from.

Next, you can see whether sending something simple is possible. Rather than ‘Can I establish a connection, run this complicated command, get a value and write it back to a log’, maybe first try writing a value to a file on the target and making sure it’s been updated.

Now you know you can access the target. You’ve vastly decreased the area in which “something” might have failed while ruling out networking and permissions as issues — nice!!

If something still isn’t working, did whatever message you’re trying to send even leave the local environment? Are you posting something to a port? Is it even there/working? Next, did whatever is supposed to get that from wherever even try to? Can it pick up anything?

Basically, this whole approach can be summed up as splitting the journey into testable steps.

Debugging, Don’t change everything everywhere all at once

As the HitchHikers Guide to the Galaxy says: “Don’t Panic”.

Take a deep breath, try to come up with a reasonable theory as to what might be wrong and change one thing. See if it works.

If it doesn’t work, you absolutely should be making sure you can see any error messages you can find or logs to explicitly tell you why something didn’t work.

At this stage, if you still don’t know where command outputs are going, go figure them out. At this point, you don’t know whether it’s going to take five iterations/changes to fix something or fifty.

Sooner or later, as you get more frustrated or your PM begins asking for the n’th day how that item is getting on, you’re going to have to figure out where/how to see this sort of information.

Go find the light switch! Add a new light switch if you have to.

READ error messages

Yes, I am clearly guilty of this, especially in the early days. Your senior-level colleagues are (mostly) not omnipotent beings.

They do know, with a humility born of years of suffering, that reading error/log output is a fabulously quick and easy way of understanding what’s not working and why.

Also, don’t skim them. READ them, slowly.

READ the docs

If you’re trying to implement something that’s new to you there’s no getting around this. Please, please go read the docs. There might even be something in there that tells you whether what you’re attempting to do will even work at all! Of course, this has NEVER happened to me… ever, honest…

Run commands/scripts directly on the target first

It’s not ‘always’ possible but do try to run whatever command you intend to automate on the actual instance or a similar test instance first.

This is absolutely going to highlight odd quirks and things you might not have considered as part of whatever it is you are trying to automate.

Use hardcoded value first (but NOT credentials!) if needed and run things locally. Then you’ll quickly find out that either the variable/value doesn’t appear in the environment from elsewhere or something else your script is ‘supposed’ to be doing crashes ’cause you’re missing a dependency.

You get a lot more information when something fails right in front of you as opposed to trying to divine it magically from some console output.

Debug/verbose flags are your friends!

Hopefully, these tips will help, in summary:

  1. Where are the Logs?
  2. Speed up the Feedback!
  3. Did it cross the boundary?
  4. Debug in steps, don’t panic!
  5. READ Error messages!!
  6. READ the docs, please, please read them
  7. Run commands before you try to automate them

Take things one step at a time, be measured, get all the info you can about what’s going on and the whole process will be much more fun! Good Luck!

About the Author:
Robert Sweetman is a Consulting Engineer here at Version 1.

--

--