Our Journey with Salt
I recently started a new gig and one of the tools in our toolchain is Salt Stack. As far as configuration management tools go, it has its ups and downs. None of them are perfect, so there’s not a lot of value in getting worked up over what in the end is almost always just personal preference.
With that in mind, I noticed a few things right away about our Salt Stack codebase. It had the familiar signs of that first go-round with a product. Hindsight always provides a certain clarity, but it was also apparent to everyone that it was time to make some serious structural changes.
When we hit the web to find places that discussed best practices, tips and tricks etc, we were surprised at the sheer silence of the community. (Or the terribleness of our Google searches) While we don’t have any tremendous insight, each of the Operations Engineers has experience with different config management tools, so with our collective wisdom we decided to chart our course and document it along the way.
What Problem Are We Solving?
The top.sls file
The first and major problem with our codebase is that the top.sls file is almost entirely driven by hostname. Salt’s powerful glob matching makes this an attractive choice at first, but it quickly becomes a trap.
- Is your host naming convention really that strict? Or do you find a lot of exceptions to the rule
- Do you plan on moving to the cloud anytime soon? You may not have complete control over your host naming convention
- When you bring up new nodes with new names, but they serve the same function what do you do?
To summarize the problem, we’re treating machines like pets instead of cattle. I knew this was something we wanted to move away from.
Pillar Data
Pillar data is one of those awesome tools that has so much potential. Unfortunately the implementation is rather un-opinionated, which leads to confusion and disarray. Are your folders environments? Applications? Both? What’s the dileniation? The lack of clarity leads to a mixing and matching of terms and concepts making it difficult to know where exactly to make changes for pillar data. It could also create a scenario where if you want to add a new node, you have to first create a bunch of pillar data to make sure the formulas are happy, which leads me to my next point.
The code is designed to rebuild today’s infrastructure
If you’re doing disaster recovery, then this approach is great. But if you need to spin up a new environment in our existing codebase, it’s problematic because of all the assumptions that are being made. New environments require a lot more pillar files to be created and maintained by hand. Not a good place to be.
A limited concept of environments
Somethings happen because we want it to happen everywhere. But do we really? I spun up a vagrant instance to do some local testing and it automatically went out and registered itself with Sensu.
These are a few of the major pain points that we are trying to address, but obviously we’re going to do it in stages. The very first thing we decided to tackle was formula assignment.
Assigning via hostname has its problems. So we opted to go with leveraging Grains on the node to indicate what type of server it was.
With the role custom grain, we can identify the type of server the node is and based on that, what formulas should be applied to it. So our top.sls file might look something like
'role': 'platform_webserver':
- match: grain
- webserver
Nothing earth shattering yet, but still a huge upgrade from where we’re at today. The key is getting the grain populated on the server instance prior to the Salt Provisioner bootstrapping the node. We have a few ideas on that, but truth be told, even if we have to manually execute a script to properly populate those fields in the meantime, that’s still a big win for us.
We’ve also decided to add a few more grains to the node to make them useful.
- Environment — This identifies the node as being part of development, staging, production etc. This will be useful to us later when we need to decide what sort of Pillar data to apply to a node.
- Location — This identifies which datacenter the node resides in. It’s easier than trying to infer via an IP address. It also allows us a special case of local for development and testing purposes
With these items decided on, our first task will be to get these grains installed on all of the existing architecture and then re-work our top file. Grains should be the only thing that dictates how a server gets formulas assigned to it. We’re making that explicit rule mainly so we have a consistent mental model of where particular functions or activities are happening and how changes will ripple throughout.
Move Cautiously, But Keep Moving
Whenever you make changes like this to how you work, there’s always going to be questions, doubts or hypotheticals that come up. My advice is to figure out what are the ones you have to deal with, what are the ones you need to think about now and what you can punt on till later. Follow the principle of YAGNI as much as possible. Tackle problems as they become problems, but pay no attention to the hypotheticals.
Another point is to be clear about the trade-offs. No system is perfect. You’ll be constantly making design choices that make one thing easer, but another thing harder. Make that choice with eyes wide open, document it and move on.
It’s so easy to get paralyzed at the whiteboard as you come up with a million and one reasons why something won’t work. Don’t give in to that pessimistic impulse. Keep driving forward, keep making decisions and tradeoffs. Keep making progress.
We’ll be back after we decide what in the hell we’re going to do with Pillar data.