Scaling Ansible

Converging Hundreds of Hosts in Less than 5 Minutes

6 min readOct 8, 2019

Our team has used Ansible extensively over the past few years, eventually adopting it for configuration management across hardware, virtual machines, and cloud instances. This post discusses some of the constraints we found with a vanilla Ansible setup, and and the ansible_puller tool we built and open-sourced to meet our needs.

Starting with Local Ansible

Ansible is a Configuration Management framework used throughout the software industry to control what actually gets put inside of a server or network appliance. It’s strengths lie in its dead-simple setup and its vast array of built-in (and third-party) modules. With a single pip install and a Yaml file you can install packages on a server, send a message to Slack, or manage an L2 network interface on a Juniper device. The list of built-in modules is huge.

There’s no need to install a central server to get started. Ansible piggybacks off of your SSH configuration to handle access to your servers. Anyone on our team could run an Ansible playbook to bring more servers online, change the configuration of running servers, or deploy code.

Migrating to a Jenkins-managed Ansible runner

It was quickly apparent that we needed to automate our Ansible infrastructure. Running Ansible exclusively by hand can lead to human error and configuration drift. Having a consistent, automated run interval for Ansible can ensure that your latest configuration is always present in your environment.

The natural choice for us was to use Jenkins. Jenkins was already managing our builds at this time and it was pretty easy to write a Jenkinsfile that ran Ansible for us. We already had all of our Ansible configuration in a Git repository, so Jenkins gave us a steady drum beat of Ansible runs, enforcing consistent configuration and removing a step from infrastructure change rollout. Jenkins came with a lot of nice features like LDAP integration, a UI for kicking off jobs, run history, and centralized logs.

As our infrastructure grew, so did our runtimes. By default, Ansible will run all of the configured tasks sequentially across all of the nodes in your cluster. There are a number of ways to tune the execution, but eventually our production site.yml job was taking 6+ hours to complete. Changes that a developer committed in the morning could roll out to production at 9pm. Switching to sharding our runs per Ansible inventory group solved the runtime issue for most of the environment, but our largest cluster was still taking 3 hours to complete.

The long rollout time was causing two main issues when trying to maintain a stable infrastructure. One, afternoon changes could still take affect during off-hours, which made for a lot of changes getting committed in the morning, which increased change density per run making it harder to triage the root cause of incidents. Two, our node groups were getting changes at different times. Smaller clusters would get updates every 30 minutes, but the largest cluster would lag by up to 6 hours (when accounting for the current run and the pending run). This meant that each change wold have to be multi-phase and rolled out over the course of a couple days to ensure that the entire environment was updated consistently.

Our growth forecast continued to trend upward and we needed to find a way around the linear correlation between the number of servers and Ansible runtime.

In search of a lower run time

The first step of investigating ways to reduce our runtime was to read through Ansible’s whitepapers, specifically, Scaling and Performance of the Ansible Management Toolchain. The “Architectural Topologies” section discusses a few modes of running Ansible. The most promising topologies were “local” and “pull”, both of which would effectively decentralize our Ansible runs. A decentralized run made each server responsible for running its own Ansible configuration instead of having a central server reach out and operate on each node in sequence. This has some great benefits for scaling as there is no limitation on outgoing SSH connections, network transfer of artifacts only happens once, fact gathering occurs locally, etc. Instead of having heavy requirements on a single node, we’re taking up a small slice of each individual node’s resources.

The “pull” topology included a pre-made script, ansible-pull, that was included in the Ansible bundle. So we initially focused our efforts on getting the script to work in our environment. The ansible-pull script does two things: clone a remote git repository, and run Ansible with a local connection scheme.

But there were a few issues that prevented us from adopting ansible-pull out of the box. ansible-pull is meant to be run via cron. This removes the centralized visibility that we had in the Jenkins setup. No more logs, or success/fail statuses. We could not enable/disable jobs easily, since it required interacting with cron on each individual host. Also, the dependency on git was non-configurable. The remote resource could not be hosted on any other mechanism. Given the size of our clusters and the load already present on our internal git servers, we had to drop the stock ansible-pull script.

So, considering the “pull” topology was a nice wrapper around the “local” execution topology, we decided to back up and see what could be accomplished with some custom code running Ansible locally.

Building Ansible-Puller

Within a couple days we had a minimum-viable-prototype built in Bash. It would curl a tarball of Ansible playbooks from an HTTP server, explode it on the filesystem, run Ansible in local mode, then clean up. We already had the infrastructure to compile/push artifacts from our build system, so we leveraged that to generate the tarball of our Ansible playbooks. Our runtimes in staging had been close to an hour, and they dropped to a couple minutes. It was more than enough to warrant further investment.

Our next step was to integrate some form of monitoring. Since our stack uses Prometheus, the first pass was just writing to the filesystem and relying on Prometheus’ Textfile Collector. While this worked, our Bash script was getting fairly long and it didn’t do production-necessary things like managing a virtualenv for Ansible, or allow for easy testing.

We migrated the project away from Bash to Go to gain the benefits of using a more robust programming language while maintaining a single deployable artifact.

Now we were able to add in a lot of features that we’d had with the older Jenkins setup. The Prometheus client library made it trivial to implement monitoring and link Ansible into our Alerting infrastructure. With the new metrics set up we could create dashboards to watch infrastructure changes roll out to our cluster.

High-level status for one of our clusters

Considering that we were already running an HTTP server to export Prometheus metrics, we also added in a small controller UI so that operators wouldn’t need to log into a host to check on status, launch a job, or disable runs for a period of time.

Why not Ansible Tower/AWX?

When we set out writing Ansible-Puller, Ansible Tower hadn’t been open sourced and we didn’t have the budget to spend on infrastructure tooling. Even now that the AWX project is out, its rolling/breaking releases are a lot to keep on top of, and all of our needs are currently met with Ansible-Puller.

Open-sourcing Ansible-Puller

We’ve found Ansible Puller to be indispensable to our infrastructure operation and configuration. It is easy to use and operate. Also, given that it’s written in Go, there are no system runtime dependencies. Just drop the binary onto a machine and watch it go! The project has been so useful to us that we released it on Github. You can check it out here: https://github.com/teslamotors/ansible_puller