OpsMop Push Mode Is Available!

Today, nearly 6 years after creating Ansible, I want to share an even better remote SSH-driven config management, deployment, and orchestration system with you.

I previously dropped news of OpsMop already, but even when that has a neat language (I hope you have been following!), it’s only interesting to purists without remote support! Remote support is the real proof it has legs. If you want to read about initial developments before push mode (and the reasons for the project), I’ve written about this previously here and here.

I hope you are excited to see this program finally gain asynchronous remote powers, but more importantly I also hope you’ll help me test it out and make it better, and if you aren’t excited, that we can work together to make something more awesome.

Let’s get on with it!

OpsMop Push Mode: Hey This Looks Familiar

Docs here: OpsMop Push Mode — code example here: push_demo.py. The docs will tell you how to run it and more about what it does. Read these now if you want and come back to the blog.

If you understand the basic structure of Ansible and imagine how an evolution of that system would look in Python 3 (but without sharing any code with it, and being able to revisit many things), you should feel right at home. Many of the concepts are the same, just a little different.

It’s not just push mode that is out today.

Today I also added “ — limit-hosts” and “ — limit-groups” and “ — extra-vars” and recently I also added “ — tags”. By today, I mean I added them literally today. I added colorization last night. I’m regularly adding features like that in 30 minutes or something, because the architecture here is pretty accommodating. Good stuff. I think you will enjoy working on it more, and in the past, I think the core of ansible was a little hard to get involved in.

When I was coding ansible before, the pace of development — probably coupled a bit with inexperience, left me to leave a lot of things un-abstracted. I had always felt that my software for other proprietary products was better software — and it was, because my time got stretched and quickly there was code in it accomodating a lot of different ideas that weren’t planned for originally. There were huge mixes of dicts and lists, and while it worked (and pretty darn well), it was often difficult to extend. Over time, we didn’t do enough code review in merging (though we did a fair amount, some projects do apparently zero), and things got harder to debug. Despite tests, fear of corner cases, supporting too many modules, and wide distro support limited the ability to refactor. This meant not only that velocity was slower than what I’m experiencing now, but the potential for future changes and future surprises gets reduced.

I remember adding roles, pretty sick from some plague, in a hotel room in California in 2013. They worked pretty great, but I’m not really positive if I *weren’t* sick if I would have been able to add them. There are other features that I swear were only possible with 6 glasses of sweet tea, and others that I almost got to but probably needed 7. (Thank you Mello Mushroom, and I’m glad I didn’t die, yay caffeine).

So anyway, I’m able to add things quickly, a lot of great features are going in, and I think they’re really clean and organized. Adding push mode was a great experience. This, to me, means all this has great potential.

Does it SSH? Yes! What about rolling updates? Serial batch size control? Change reporting? YES! All of that is there, and some of it is better and more flexible. Other things are less crufty. There’s potential, and the modules are going to be rolling in.

What did I mean by less cruft? For instance, we have host patterns, but patterns are clearly demarcated between host patterns and group patterns. It’s the little stuff. Things that used to take comma delimited things often take lists (because it’s Python). Everything feels consistent.

The most significant thing: the core is fast as hell.

Understanding Speed And The History of Ansible

The SSH implementation of Ansible has been talked about before, but it was all based on the evolution of an original architecture I started back in 2012. And that architecture came from a combination of the “no-agents” requirement I gave myself, coupled with the need to do something reasonably easy.

No-agents came from some pretty simple problems — puppetca back in that day (I have zero experience with modern Puppet and don’t want to knock it at all!) was fiddly with NTP and DNS, and I always had trouble setting it up, and also if you have an agent system, if the agents die, you cant’ talk to those systems. Further, in any enterprise software application, upgrading agents can be a tricky dance in terms of ordering and recovery if something is wrong. I was tired of having to encounter Puppet at work, so ansible was about, literally, giving developers time back.

It was never supposed to be a company. I told myself many times “if this is ever complex enough to support a consultancy, I will have failed”. And maybe I did. But despite lots of chiding from the competition (“First they ignore you, then they laugh at you…”), I proved that SSH management of systems was not only feasible, it was pretty great.

Ansible started out with just paramiko, because at the time SSH was really slow because the old SSHd in CentOS didn’t support ControlPersist. Evolutions of that moved from paramiko to /bin/ssh, added ControlPersist support and defaulting to SSH when it was available (eventually all distros added it — awesome), upgraded efficiency with “pipelining”, and experimented with alternative transports like “accelerated” and “fireball” mode-maybe not in that order. Things got pretty good given the architecture they had to contend with. Maybe not just limited to that, that’s all I knew how to do at the time.

Ultimately though, ansible still forked a giant ton and it relies on a lot of context switching and sends a lot of data, which is well documented by David Wilson’s mitogen blogs. Honestly, he’s analyzed ansible’s performance way more than I ever have.

There were other performance tuning adventures that I learned from too...

One of them: Midway through the course of ansible’s development we had a customer with about 100–200 machines but that were regularly around 99–100% utilization (this is bad for not just ansible but app efficiency in general, don’t do that!). For ansible, most resources would execute quickly, but out of that 200 or so, one would always take a little bit longer, sometimes even 10 seconds longer. The problem was that while this would normally slow down a 20-task run by 10 seconds, each task in each role in ansible happened at the same time, and all 200 had to complete to continue. This exact problem is the reason I added the requirement for the “free” strategy in Ansible 2, though I didn’t implement it and haven’t observed the output of it ever. I am really happy with the async output in OpsMop.

Another problem with ansible is that the modules are transferred all of the time. We had one user who was trying to communicate over Satellite links to configure devices in developing countries. I believe I coached him into using local mode.

To improve module transfer efficiency, I always wanted to transfer *all* of the modules, and then only replace the module set when they were out of date (probably using a checksum test at the beginning). The idea was originally conceived by Seth Vidal (who was so helpful on so many levels and was the main reason I continued the project in early 2012). Ansible did acquire a feature called this after I was no longer with the company, but it did something different and did not reduce module transfer in the same way.

Aside: I also want to thank Jesse Keating, then at Rackspace (now GitHub), who was amazing for performance testing against very large clusters. Most improvements in Ansible fork performance came from his help testing.

Those reading this so far may be wondering — is he saying ansible didn’t scale? FAR FROM IT.

Ansible scaled widely pretty well for all manner of use cases thrown against it, from my experience. Most of the time, you do NOT want to configure all of your systems at once — this will likely melt an update server on fire, and if there is a problem, you will be making thousands of problems.

There were a lot of myths thrown against us by the competition, but they never hit us in IRC. What I am saying is Ansible has the potential to achieve a lot lower latency and execute configuration runs TONS faster — and this is what OpsMop is able to do -by not having those original architectures — but more importantly, taking advantage of mitogen. It is clearly way faster per task in local mode and push mode, and that may not matter to you at all. As XKCD says, “compiling!”. But for many of you, it does!

(In fact, the scale-out of this current architecture needs your help testing and tuning too, I’m sure it’s not perfect)

Anyway, there was room for improvement. Let’s talk about details.

Why can Opsmop go faster?

Partly, OpsMop is faster because it doesn’t need to execute shell scripts, and it doesn’t need to fork.

Since mitogen has done a lot of good work on improving on ansible, it made since to evaluate using mitogen directly.

At first I had thought it made sense to just rsync the policy folder in OpsMop and execute it once, but mitogen ended up being far cleaner. It has support for asynchronous callbacks, infinitely nested bastions, and so much goodness. Had it been around when ansible was lifting off the ground, it would have solved many headaches in trying to make sudo to work reliably on different platforms. It’s all really really awesome as a library.

As I started using it, I noticed it was really easy to keep connections open *between* roles, and lots of asynchronous magic allowed the callbacks to be written very easily, which lent itself to having a really good CLI for opsmop push output.

Further, OpsMop makes only one remote function call per host per role. Whereas with the other architecture, not only is it making a set of operations for task, that number of operations is significantly more than 1 SSH op.

The result is parallelism with less traffic once connections are established.

The example where the nodes are all close to 99% CPU is less of a problem due to the way each node does not have to be doing the same task at the same time — just the same role — and the output stays decently clean and nicely summarized.

In fact, even in this mode, if you want to login to a remote and view the log as if it were running locally, you can do that, it’s ~/.opsmop/opsmop.log (the path is configurable to anything, but this was a good default with unknown permissions). You’ll see the actions taken on just that host, and it’s pretty perfect if you are using a log aggregation system to just funnel those away to Splunk, SumoLogic, or Loggly.

More Things Are Also Added

Of course, push modes bring things into the program that don’t have to exist in local modes.

We also need to have an inventory system, and logic to decide default usernames, and a lot of things and … WHOA … quickly we have almost implemented almost all of the things I did back in early-mid 2012. But that’s good though. Let me say that again: we have already implemented most things I have done back in 2012. This is quite usable and largely complete structurely, if not in module count, only a few weeks in!

Of the things that are new, the most significant is inventory. If you are on a cloud system the best inventory is dynamic, and I’ll be looking for someone to help write an AWS inventory class! I’ve already supplied an example TOML one which is good for physical infrastructure, and I like it better structure wise. The nice thing about the TOML version is the data structure you get from loading the TOML is *EXACTLY* the same data structure any other inventory class would have to provide.

Inventory in ansible was always somewhat controversial, and I’m honestly not sure why, but in OpsMop, inventory works via subclasses of “Inventory” and any object that returns a list of Host and Groups object is fair game. Both objects can have variables, and some variables can influence SSH usernames, python paths, and stuff like that. It should seem familar, but is executed in process rather than forking to run a program that emits JSON. Ansible 2 has probably changed and I have zero idea how that inventory system works now.

This is all covered in the push docs.

Why Is OpsMop More Flexible?

So we’ve talked about why OpsMop can be faster, and also have talked about it having new inventory subsystems.

Other things that come into play when talking about an orchestration system are how flexible it is, whether it can adapt to the workflows you need.

When we talk about “Infrastructure as Data”, typically this means sourcing data from various plugins or facts — this is of course 100% fine, but it also carefully constrains the attach points. Some users probably WANT this level of constraint, but others want a more free flowing system, where they are free to pull in external data in more direct ways. This is where a pure “Infrastructure as Code” system is superior. The middle of course is the “Infrastructure as Custom Language Only We Can Parse” — which has often easier data entry than “Infrastructure as Data”, but does not have the flexibility of pure “Infrastructure as Pure Code”. It’s all about tradeoffs.

Bottom line though, I believe you want flexibility. If you want to do a rolling update with a distributed lock, that’s damn simple in OpsMop.

OpsMop is more flexible because it is pure code. Rather than just having to rely on delegating to a load balancer module, pre and post hooks in the roles can do *anything*, or write a mix of facts, filters, and plugins, you can just do mostly whatever you want.

If you decide ANY random setting should come from ANY data source, you can do that.

Learning Push Mode

If you’re still here, I hope you are interested to try things out.

First off, the opsmop demo repo is the best for grokking the language. It’s a little abstract, it doesn’t really model real applications, it teaches the language more. Here you go. The file in question is https://github.com/opsmop/opsmop-demo/blob/master/content/push_demo.py

Push_demo is a little long, because I’ve added functions for most things push mode can do, language-wise. Your examples will be likely *SHORTER* in terms of all the special functions. But basically you can understand that behavior is described in Python, but there are some very nice abstractions and declarative layers available.

I really should have a “push_demo.py” and a “advanced_push_demo.py” and will probably break these up in the near future.

Once you have read the demo file, consult the push-mode documentation, or do both at the same time, and try out push_demo on some of your own systems.

You’ll need to edit the TOML inventory file to make it work, and possibly create a ~/.opsmop/defaults.toml to set your login preferences.

What’s nice about OpsMop push mode is not only that it is exceptionally fast, but that also when it runs it reports a report on what changed on what host. This is all thanks to the very well organized type/provider model in OpsMop, which is something I *did* want in Ansible, but was quickly buried under a massive volume of pull requests (thank you for the pull requests btw!).

Feedback

Sorry, this blog is all over the place. Read the docs. Try it out. Tell me what you think.

OpsMop pull mode is I think, pretty great, but it can also be improved — but that’s going to take your input. Likely, some important use cases for you may be missing, but I want to hear them and add support for many of them.

As I’ve probably said a few times before, I’m really in Open Source software development for the interactions with people like you. If you have ideas on OpsMop and think you may want to use it going forward, or possibly want to add something or have an language idea to share, stop by and post on the official Discourse forum. Let’s get to know one another and make some awesome software together.

Thank you!