Some thoughts on bash Automation usability

I’ve spent the bulk of my career as an Infrastructure-as-a-Service *nix systems worker, so predilection for automation, and reproducibility is something that is probably the closest thing that comes to a core competency in this field, in an abstract sense (vs. concrete skills like encyclopedic knowledge of the kernel structure in Linux, or memorizing flags for commands you use every day; what I’m speaking of underlies a lot of this, if you spend any amount of time bash scripting), when making effective decisions; you might constantly be asking yourself “How can I do this quickly, in a way that won’t vary, and controls what I can — and cannot-mess-up?”.

Tooling is important for an Ops team, and even the best Ops teams, maintaining the Internet’s critical architecture, can sometimes make a mistake in locking down potentially destructive (or merely overzealous) actions through automation tooling:

If any Systems Administrator, Operations Engineer, or DevOps professional tells you they haven’t done something unintentionally that, in retrospect, feels like they should’ve lost their job, they’re probably lying. They all have “that” story; perhaps you, for example, were testing I/O on a virtual disk (something where a tool like hdparm may not be reliable), and switched the targets for if and of in a dd command-based test with /dev/xvda in the absolute worst place. Maybe yours is worse, maybe it’s funnier, but it’s happened at some point, for whatever reason (fatigue, carelessness, poorly documented options, the list goes on).

Because I’ve never worked at AWS, I don’t know how their approach to automation manifests in the building of tools, but I can say, how you produce them (language, etc.) is far less important than what they actually do, and knowing how to segment functionality, and create failsafes — but ones that do not make the tool a burden to use (remember that you’re automating to avoid keystrokes and excessive confirmation, otherwise, you’d might as well just for loop through a list of servers and remote execute a command you write out by hand every single time) — that prevent something careless like a command like argument, or poorly documented flags, from creating a situation where your reach exceeds your grasp.

I have a few guidelines I like to keep in mind when I write scripts (Python, Ruby — usually in the form of rake tasks-, bash, etc.):

  1. Make it excessively documented: You can never have too much exposition and clarity, either in a manpage, or a help prompt, just make sure everything it can do is documented accurately. A poor carpenter blames their tools; the problem is that these tools can, absolutely, lie to you if the documentation is bad. You can follow the directions down to the letter, but without access to the code (as a user, not the author), you’re putting your faith in the developer.
  2. Wrap each action into a function; this can prevent different activities (triggered by a flag, or argument, for example) from overlapping and running when you don’t intend them to. This has to be backed by very clear, concise control structures.
  3. More pragmatically; if you are doing something destructive, create things like a --test or --dry-run flag with very verbose output, so you know what would happen if it ran.

I’d like to use the example of a script I wrote recently to manage infrastructure for an app I built; the problems I encountered were mainly surrounding that I hadn’t planned for anyone to actually use it, so its popularity took me by surprise, and I needed to scale, and I did that by just deploying more instances on DigitalOcean, and pulling down my application from my Docker registry, and deployed behind a load-balancer. I was doing this manually, and pushing out frequent enough updates, that I debated moving the application to a more scalable/compatible container orchestration (If you read my posts at all, you’ll know how I normally feel about planning for failure, and how I address this with tooling like Kubernetes and Docker Swarm), but ultimately, just wrote a script to quickly manage the deployment in the interim.

#!/bin/bash
pull () {
for drop in `doctl compute droplet list | grep $APP_NAME- | awk '{print $3}'`; do ssh root@$drop docker pull registry.com/jmarhee/$APP_NAME; done
}
clean () {
for drop in `doctl compute droplet list | grep $APP_NAME- | awk '{print $3}'`; do ssh root@$drop docker rm -f $APP_NAME; done
}
rebuild_c () {
for drop in `doctl compute droplet list | grep $APP_NAME- | awk '{print $3}'`; do echo "$drop: \n"; echo "Deleting: "; ssh root@$drop docker rm -f arcologyio; echo "Running: "; ssh root@$drop docker run -d -p 80:4567 -p 443:4567 --restart=always --name $APP_NAME registry.com/jmarhee/$APP_NAME; echo -e "\n"; done
}
run_c () {
for drop in `doctl compute droplet list | grep $APP_NAME- | awk '{print $3}'`; do ssh root@$drop docker run -d -p 80:4567 -p 443:4567 --restart=always --name $APP_NAME registry.com/jmarhee/$APP_NAME; done
}
status () {
for drop in `doctl compute droplet list | grep $APP_NAME- | awk '{print $3}'`; do echo -e "$drop: "; echo -e "---------------\n"; ssh root@$drop docker ps; echo -e "\n"; echo -e "Connectivity Test $drop"; curl -Ik -s http://$drop | grep HTTP; echo -e "\n---------------\n";done
}
main () {
if [[ $1 == "pull" ]]; then
pull
elif [[ $1 == "deploy" ]]; then
pull && run_c
elif [[ $1 == "rebuild" ]]; then
if [[ $2 == "soft" ]]; then
rebuild_c
else
pull && rebuild_c
fi
elif [[ $1 == "clean" ]]; then
clean
elif [[ $1 == "status" ]]; then
status
else
echo "Usage: ./deploy.sh {pull | deploy | rebuild | clean}"
fi
}
main $@

So, you’ll notice a few things:

  1. I didn’t follow my own rules (I’ll detail this a little bit in a moment).
  2. I heavily leverage doctl as a meta-host-discovery tool (supremely bad practice, but for my purposes, worked, since I knew what I was looking for would be unique enough to work as an effective scope of servers).
  3. I use a main function to implement a simple enough if/else statement to parse out junk arguments and get me the action I want.
  4. This is a very specific script that addresses a very specific (hopefully, temporary) need, so it does a lot of things I just didn’t care to do manually, which always introduces a risk if not implemented (and tested, and designed to be self-checking) properly.

First, let’s talk about what I did right:

By filtering using the first argument $1 when I run my script (`deploy.sh` we’ll call it), it prevents functions from being fired off when I don’t need them to (because it’s not my argument that invokes the function), but also only utilizes secondary arguments (i.e. $2 ) when explicitly referenced in the function (so it has to, first, match a $1 condition). The script is (relatively) verbose; most actions are explained, so you know what you’re seeing. I use pretty specific, pointed options for arguments (`rebuild` vs. deploy for example, is pretty specific). I keep the only destructive function (`clean`) in its own space, so it won’t be invoked while trying to call something else that doesn't, at least, try to clean up after itself.

So, where can this go wrong?

Well:

  1. rebuild_c invokes the same activity as clean (albeit manually, and not by invoking the function, which is a bad programming practice for a few reason; one major reason here is that the behavior is not predictable), and does not provide an escape if a clean succeeds, and a run_c (deploy)(the same reproducibility issue as the earlier function) fails for whatever reason. This can be destructive, and result in more work (you’d, ideally, have to pull and then run_c (deploy)again, which is about as many steps as doing all of this work yourself).
  2. Documentation is bad: That the commands are somewhat self-explanatory does not indicate whether or not a command is destructive, or even what any of them do. Good naming convention sometimes (usually…okay, always) isn’t enough.
  3. There’s no error handling where functions are chained together (in this case, explicitly, rather than invoking the function; compounding the issue I discuss above by introducing a new sequence of events to debug).
  4. So many other things.

My point in highlighting this specific script is that it came into being through the exact type of circumstance that can make a bad practice a permanent one, and thus, difficult to get rid of. Improving such tools is important, if you plan to use them on any quotidian basis because, you could argue, that, if it does become a long-standing piece of tooling, you’d like it to work properly, and in a way that your teammates, maybe many employee-generations removed from the author, can understand it.

My above observations are, by no means, a perfect (or anywhere near complete) accounting of what I did right or wrong, or what you, personally, should do (your tools should, after all, reflect your needs), but hopefully, will keep this one idea in mind:

Your tools need to evolve as its users (not only the number of users, but all sorts of heuristics you cannot predict); they need to be maintained (what if you upgrade a system and the scripts no longer run?); and they need to become safer, stricter, and more predictable as these other factors change as well. Your tools need to be a living piece of your ecosystem to continue to be useful; if it’s not making your job easier, then perhaps it’s time to re-assess your approach, and if that automation is, truly, the solution you and your team may need.

With some operational carefulness becoming quotidian, the opportunity to mess something up does not, by any stretch of the imagination, disappear, but it does get mitigated, and it can become manageable, and hopefully, the outcomes of those unfortunate times get smaller, and more trivial (combined with other operational practices around how immutable the environment is, how reliant on each, individual instance a component service is, etc.) as these tools evolve.