DevOps: AKA Professional Yak Shaver

Four things you can do to minimise yak shaving

Fernando Villalba
The Startup
7 min readAug 7, 2020

--

You wake up in the morning in a very good mood; you have some exciting ideas for your new project, let’s say migrating your microservices to kubernetes, or improve the monitoring of the whole company, your CI/CD pipelines, or whatever tickles your fancy.

But first you need to do a very simple task for another team, you need to create an AWS resource to spin a new service and give it the correct role permissions. Should be a five minute thing, and then you can start working on your dreams.

So you log in to your AWS console with your two scripts, one to log in with your OTA and another one to assume a role. Then you separately log in to the console (again!) and then assume another role which should be easy cause it should be part of the history, as you did this before. But wait, AWS only lets you save five roles history in the console, so you have to enter all that account information and role again manually, which you don’t remember and need to look it up. You start getting into a bad mood because it feels ridiculous having to do this every day. (And yes, I know there is an extension to solve this, but let’s say your company policy does not allow this)

You are finally done with all that crap and then you proceed to deploy your resource but you realise that the naming of some dependencies is wrong, and you need to rename that. Should be a simple non-breaking change, you think, so you do it since it’s a dev environment and feel optimistic, but later you realise that it breaks something else.

Not a problem, you still cling to your morning cheerfulness because this also has a fairly easy solution that you luckily discovered in only ten minutes. You just need to quickly change some config on a legacy server and reload it. You quickly go to ssh into the machine but then you realise that you cannot connect because you forgot to login to the VPN. You try to log in to the VPN but something is wrong and it keeps denying you access, after twenty attempts you realise your password, which is managed by a third party SAML authenticator has expired, so you go and change that and you finally manage to log in to your VPN.

Finally! You try and ssh into the server only to realise that the key pair that was used when this server was provisioned is not in your hands, and you need to find the person who can either do this or have him send you the key in Signal Messenger or with Keybase. Suddenly your five minute task is taking more than an hour of your time… almost there, you say to yourself.

You manage to track the person down after half an hour and they amend the configuration (while in the meantime you just killed their flow of work) and then you can happily go and see all your changes taking effect and your new service in action, however your colleague in a different team tells you that this service also needs to connect to yet another service, and this is not happening.

Why? You think. The security group is all fine, everything looks good, you can connect to that other service and everyone else who needs can. You think about this for a while and look at the configuration and then you notice that the peering between that new VPC and this other service was not done.

No problem, you create a module for this VPC peering, it should be easy to do, you go and create another terraform entry for this and run a plan, and then you realise that the naming for the new VPC is wrong, so you need to change that again or change your terraform code to account for this variation, this ends up taking you a few hours more because believe or not, it wasn’t the last of your issues for the day!

And on and on and on and on… when you are finally done provisioning your simple service you look at the clock and it’s already 4:30PM, but no problem, tomorrow is another day…

Does that story sound too familiar? I swear to you I am not exaggerating one bit when I say I had days far worse than the one I am describing above and it eventually starts really grating on you.

I want to feel like DevOps and Infrastructure Engineering is kind of like Dr Manhattan terraforming Mars, but in reality is more like Walter White changing a lightbulb, every single day:

Mitigating Yak Shaving

But does it have to be like this? Can you completely eliminate this kind of thing? I am not sure if you can ever get rid of yak shaving completely, but there are a few things I feel that can be done when you first start a new infrastructure that can really help you down the line a lot.

Use GCP instead of AWS

I know, I know, I am very opinionated towards GCP over AWS but I have good reasons for this. GCP will not prevent you from creating yak shaving nightmares, especially if you do very bad planning and you don’t design infrastructure well but it can help a lot. Let’s look at how much easier the scenario I described above would have been if I was instead using GCP:

  • Logging in: In GCP you only log in once to your google account and that applies to your CLI as well. Also GCP doesn’t force a draconian policy of having to relogin every 12 hours. So say bye bye to all the idiotic multiple daily logging in. With GCP is also easier to manage projects than it is with AWS because it just works out of the box.
  • Assuming roles: You don’t need to keep assuming roles, or remember account names in GCP, if you have access to it, you will see it in the console, that’s it.
  • No VPNs: GCP has a very cool service called Identity Aware Proxy that does away with tedious VPNs, it also allows you create more complex policies as to what devices can log in and from where, etc. VPNs are a thing of the past and Google themselves don’t use them. I am sure that perhaps you can implement this somehow in AWS, but once again this is easier to do in GCP.
  • Adding instance metadata on the fly: If you have access to the project and you need to connect to an instance and your ssh key has not been added, this is not a problem at all. All you have to do is add your public key pair and GCP will add it to the instance immediately, no need to to stop the instance like you have to do in AWS. And yes, I know that we shouldn’t be sshing into machines at this day and age, but when this is needed, it’s god sent. Or you can just use the shell console, which works really well, not even a need to do the key pairs.
  • VPCs: You can do shared VPCs in both AWS and GCP but in GCP you can also have global VPCs, which means that you can have one VPC with resources all over the world, this can simplify infrastructure a lot and save you a lot of frustration.

Again, I am not saying that using GCP will automatically exempt you from designing infrastructure well, I am just saying that out of the box it’s just much easier and simpler to use, and much of the drudgery that is just impossible to do away with in AWS, you don’t have to do in GCP.

Agree on standards early

Have naming standards early in your company and continuously plan for the future, it’s a lot harder I know, but it definitely does pay off. Ensure that these are documented, and most importantly, try and design automation that breaks when standards are not followed. Automation that relies on standards generally yields simpler code because you don’t need to accommodate for every possibility and if it fails when these standards are not followed, then you know that you need to change the name to comply.

Design VPCs well

Unless you have a very strict and compelling reason to start churning out VPCs like there is no tomorrow my recommendation is this: Do one Shared VPC for your lower tier environments (dev, staging, etc.) and another one for production and provision all your resources there. You can do the segmentation via subnet and/or firewall and security groups, rather than VPCs.

Otherwise if you start creating VPCs for every little segmentation in your organisation you are going to find yourself with a real mess of peerings, dns hosted zones, etc. This gets further complicated if your VPCs are across multiple accounts, so bear that in mind.

Design to handover

Work on things that you can hand over to other people so they can easily do it themselves. Create automation services rather than complex abominations where you are a dependency. Stop worrying that you will lose your job, yet work as if you were going to leave tomorrow.

Whenever I create something I always try and aim to make easy automation that I can delegate tasks to others. This is not because I am lazy (but I am, shhhh), it is because I don’t want to be a bottleneck to anything, I want others to pick up my work and not need me for anything. This also allows me to focus on the next project and not having to bother on day to day tasks.

This is a lot easier said than done, and you will not always fully succeed at achieving this, but it should always be your aim.

Conclusion

This is by no means an exhaustive list of things you can do to minimise your drudgery down the line, just some ideas I have drawn from my experience. Always aim for the simplest possible solution to any problem, and don’t install and add things because they are cool, focus on the problem you need to solve and do that, complexity will always come, you don’t need to invite it over for dinner.

Also remember if your operation is very small you may not even need AWS or GCP. Service providers like Linode or Digital Ocean also do a very good job and they are super simple to use, don’t chose things that are complex just because it’s a trend.

--

--