Configuration Mismanagement

Repost: http://www.scriptcrafty.com/configuration-mismanagement-or-why-i-hate-puppet-ansible-salt-etc/

This is going to make a whole bunch of people angry so before everyone gets their pitchforks ready here’s some background information to set the stage.

Act 1

At my very first programming job I was basically tasked with writing some tests for power plant control software. The place was a Microsoft shop (Visual Studio, C#/.NET, SQL Server, Virtual Machine Manager, PowerShell, Active Directory, etc.) and it was before Microsoft became cool again by open sourcing all their awesome tools. I went into the job thinking who in their right mind would be using Microsoft tools when the open source world was full of such goodness. After a few months I realized how wrong I was and to this day I have not found a more productive stack than what I had at that first job.

It didn’t take me long to figure out that to do my job properly I needed what the professionals called configuration management. I of course did not know the proper terms and had no preconceptions about what I needed to do to get everything in working order. I just asked myself what is the bare minimum that I need and what would an automation framework for doing what I need look like. Nothing that requires rocket science.

Well, the first thing I needed was a computer that was in a known good state and had the latest version of the project deployed on it. The second thing that I needed was a tool to do that. Fortunately the folks I worked with were smart enough to generate versioned artifacts and put it somewhere accessible on the internal network. They were also smart enough to version their database schemas. Those two pieces (code and schema versions) were the first half. The second half was a working computer and since we were using Virtual Machine Manager that was in theory taken care of as well. All I had to do was orchestrate the process somehow and PowerShell with it’s remote management capabilities was just what the doctor ordered. So I wrote a basic PowerShell script to glue some bits together and set up a pool of VMs running the latest version of the code at a press of a button. It took me about a week to learn PowerShell along with all its associated APIs for talking to Virtual Machine Manager and SQL server to get them to do what I wanted and another week to weed out all the bugs and set up proper recovery mechanisms.

Once I had things working what used to take a day or two now took less than a few minutes. The feedback loop from cutting a version of the software to having it up and running was shortened and developers no longer lost the context on what was going on with a specific version of the code. So it was a win all around.

Throughout this entire process I had no clue what chef, puppet, ansible, etc. were all about. I just did what felt like the obvious thing to do and my intuitions were right. Reducing feedback loops in software systems is always a good thing and I took the pieces I had and worked towards that goal. I was young and had no religious affiliations with any tools. If something was easy to understand and helped tighten feedback loops then it was a tool that I wanted in my tool belt. Little did I know what was awaiting me at places that knew how to do things “right”.

Act 2

Some time later I got a job at another place with a substantial raise. The first place was shortchanging me by quite a bit and at the time it seemed ok but when I found out how much I was gonna get paid in the vicinity of SF I could no longer justify the low salary. So I packed my bags and moved to the technology echo chamber.

The new place had quite a few smart folks and I learned a ton from them just like at my first job but this time around things very slightly different. The stuff I had built at my first job taught me that my intuitions about software systems in general were mostly correct so I was more sure of myself at the second job but I was still not sure enough to assert myself when it came to design and architecture.

I had no product development background so at this job I was tasked with stuff related to the build and deployment pipelines. The only things I had on my resume were about building solid pipelines and orchestration mechanisms so this was a pretty good fit for what I had done at my first job except this place was not a Microsoft shop. Fortunately the folks that hired me were smart enough to realize that solid engineers come from all sorts of backgrounds and didn’t hold my Microsoft experience against me.

Their pipeline for building deployable artifacts revolved around debian packages with basic pre and post install scripts. The guy that had built this part was pretty smart and he had made all the right decisions for this part. The part where he messed up a little bit was with the configuration management. He had looked around him to see what everyone else was doing (puppet) and had just copied that thinking he could rely on ambient knowledge and experience. He probably should have trusted his intuitions a bit more and not bowed to peer pressure but by the time I got there the system was in place and it was working, except for one little issue.

Whenever we had to make a change in the production environment related to deploying the application or modifying any configuration related to the application it was more hassle than intuitively felt justifiable to me. The hassle came from the unnecessary friction and communication overhead. A typical conversation would go like this:

dev) Hey I need to change the database username/password how do I do that? I don’t see any place in the code base where I can do that.
ops) Oh that’s because we have this extra place where we keep that stuff. In order to make those changes you need
 to make sure you’re reading the settings from file foobar.json and it’s placed there with this thing *points to
 the repo that contains puppet modules for configuring DB setting*.
dev) Umm, why can’t I just have a token in the code base I can use to query a service to give me that data?
ops) We’re working on it but for now we have this puppet thing *points again to the puppet thing*.
dev) *Looks at puppet thing and thinks a little bit *. Ya that makes no sense. I know how to write Python. What is the weird arrow syntax about? I see you call this thing a class. Is that like a class in Python?
ops) No, different class. Completely unrelated to classes in OOP languages.
dev) What about this module thing? Is that like a Python or Ruby module?
ops) No, different semantics. Only Puppet understands Puppet modules. You know what I’ll just do it for you.
* Some time passes *
dev) Hey, my server is crashing. I can’t read that JSON file you pointed me to.
ops) Oops, I had a typo. Called it foobaz.json instead of foobar.json. Let me fix and push the changes.
* Customers see 503 responses because server reloaded when changes were pushed *
dev) Hey folks are seeing 503s and I still can’t connect to DB.
ops) What was the password you wanted?
dev) “asdfadsfjffjkkdjf”.
ops) Oh, I put in “asdfadsfjffjkkdjff”. There was an extra ‘f’. Let me fix and push changes.
* More 503s *
dev) So this is a little suboptimal. I’d like to make those changes and test them locally. How can I do that?
ops) Well we don’t have a good way to do that. It would mean duplicating some logic and using if statements and other things Puppet doesn’t like because the Puppet bible says those things are bad. So just do what you need to do locally and then tell it to me in English and I’ll translate it to Puppet. Kinda like a human compiler.
dev) Great, that sounds awesome. I wasn’t planning on learning Puppet anyway. Python is enough for what I need to do.

Doesn’t take a genius to see that process is broken and that there is some kind of impedance mismatch and the tools are not helping. At the back of my mind I had a few nagging questions and thoughts: Why the fuck are build/ops folks making decisions about how the application should be deployed and configured? Why is the database configuration in one place, the schema definition in another place, and why do we enforce these constraints by convention instead of code? How the hell would the developer know to read from foobar.json instead of foobaz.json? I don’t tell them what to put in their requirements.txt so why am I the one making decisions about how to deploy and configure their application or what DB to connect to and what password to use? Something here doesn’t add up. I’m never one to just let cognitive dissonance go unresolved and when the dissonance rose to peak “what the fucking hell” I dropped whatever I was doing and started prototyping the kind of system I would want to use if I was writing the application.

My idea was very simple. Just like the application had requirements.txt for making the virtualenv and putting it in the debian package I was going to do exactly the same with the configuration that was managed with puppet. The problem is that puppet does not like running in standalone mode (it can be done but is too much hassle and I’ve since learned that if a tool takes something obvious and intuitive and turns it into a hard problem then that tool should be replaced as soon as possible). I didn’t have to write anything new because chef combined with librarian-chef would do exactly what I needed it to do without putting up bullshit barriers. Just like there was a requirements.txt there would now be a cheffile that listed all the configuration dependencies of the application. At build time I would pull in all the recipes and hook it up to the post-install script. I forget the actual call syntax but I basically just added a line to the post-install script that called chef-solo and ran the recipes. After convincing a few folks that this was a good idea I converted all the application specific bits in puppet to chef recipes, pointed the developers at the recipes, and told them they were now in charge of all configuration related to their application.

At first there was some resistance but everyone warmed up to the new way of doing things quickly enough. Not only was this a much better way to do things but it was also locally testable. No one had to learn any puppet DSL bullshit to figure out if what they were doing made sense. They just had to write some ruby and then run it with chef-solo and see the results directly. If something didn’t work the error would tell them exactly where things went wrong and instead of trying to make sense of puppet DSL errors they would get a callstack and the exact line where things were failing. Remember the bit about feedback loops? This reduced the feedback loop and put the people that should have been in charge of configuration in charge of configuration. Most importantly it turned the problem of deployment and configuration into a collaborative problem instead of an adversarial one driven by divisions in language choice and tool culture and offloaded the testing burden from one place to many. It was again a win all around.

Act 3

Since those two experiences I’ve worked at a few more places and I’ve seen the same mistakes repeated over and over again: Centralized logic where none is required, Weird DSLs and templating languages with convoluted error messages, Deployment and configuration logic disembodied from the applications that required them and written by people who have no idea what the application requires, Weird configuration dependencies that are completely untestable in a development environment, Broken secrets/token management and the heroic workarounds, Divergent and separate pipelines for development and production environments even though the whole point of these tools is to make things re-usable, and so on and so forth.

I sometimes wonder if I’m the only sane person left in the world. How is it that these things are obvious to me but not obvious to others? I don’t think I’m that much smarter than the people that built these systems and tools but how the hell did they fuck up so badly? Why the fuck do I need a centralized store for configuring an application server? Why isn’t the default mode to run everything standalone and then throw up huge warnings telling people they should think long and hard about using a centralized configuration server? Why don’t any of these tools have an obvious local testing story? Why do I need to go through heroic efforts to set up a testing environment to verify that the changes I’m making are not gonna choke on a syntax error and then give me some convoluted error message? Why aren’t these tools built for developers first and operations folks second? What is the point of shrouding the obvious in YAML or some other weird DSL? Why can’t I package and version the configuration the same way I can package and version an application and then deploy it with apt-get, yum, etc. and why isn’t this part of the toolset? Finally and most importantly why am I the only one that seems to care about these things?

Conclusion

So that ladies and gentlemen is why I hate anything and everything that uses YAML or some other weird custom DSL to do the obvious. The current top offenders are salt, puppet, and ansible and they’re gaining more followers by the day. Everything that seems intuitive and obvious to me is either an anti-pattern or straight-up impossible to do with any of those tools. Instead of writing libraries of idempotent components they had to layer custom nonsense on top and completely change sensible programming language semantics. Somehow I’m the only one that ended up with the right set of experiences that taught me to avoid all the things these tools champion. I’m sure the madness will stop at some point but not before a whole bunch of hair is lost trying to figure out why some snippet of YAML is not creating the right unix user (this by the way has happened more than enough times now that I’ve stopped counting).

Show your support

Clapping shows how much you appreciated IT MARKETPLACE’s story.