How we migrated infrastructure management from Chef to Ansible

Roman Kuchin
Pipedrive R&D Blog
Published in
6 min readApr 4, 2023

Intro

Migrating from one automation tool to another can be complex and challenging, especially when dealing with cloud infrastructure management. Our team has recently decided to replace Chef with Ansible in our fast-changing cloud environment, and we faced several obstacles along the way.

Our infrastructure

We have thousands of Terraform-created, Chef-managed virtual machines in around 25 different environments, some of which are in on-premises OpenStack, and some in AWS.

Why didn’t Chef fit our VM management?

We used to employ one Chef-server per environment and handle all infrastructure changes via pull requests to Chef-repo. When we merge a pull request, all changes are applied on all virtual machines within 30 minutes. For testing purposes, we use a development environment and several test envs. Still, there is always something we can’t properly test outside of production.

Another issue we were facing was the price of licenses for new versions of Chef server. We couldn’t upgrade our Chef servers and Chef clients without committing to huge financial investments. As a result, we were several major versions behind, meaning we had to rewrite a lot of things anyway in case of Chef upgrading to an up-to-date version.

Despite all of the above, we still didn’t see enough justification for rewriting the entire infrastructure code.

Looking for more motivation

Ansible allows us to:

  • Use up-to-date versions
  • Simplify structure, Ansible provides less ways to overcomplicate things
  • Benefit from easier server installation and new region bootstrapping
  • Run playbooks from any given branch on any given VM
  • Use push and pull models at the same time

Another argument for adopting Ansible is that Chef is considered dated, whereas Ansible is only gaining traction. This means adopting it will allow our engineers to use modern tools and make it easier to hire new talent. Many meetings, demos, how-to’s and hands-on sessions inside our team later, we made the decision to go ahead and move our infrastructure management to Ansible.

Previous use of Ansible

Although it may seem like we had no previous experience with Ansible, that was not the case. We used Ansible for network and hardware management, as well as installing and upgrading our on-premises OpenStack.

Architecture

The default model for Ansible is push, but in our case, using a management host to manage our infrastructure at our scale wasn’t feasible. Instead, we decided to use the pull model, which is similar to Chef, and run Ansible-pull from the master branch every hour. To speed up the process, we created a small wrapper around Ansible-pull and called it Ansible-client, which triggered Ansible-pull with predefined parameters once per hour.

Challenges and solutions

Before we could use Ansible in production, several issues needed to be addressed:

Inventory

The only way to use Ansible was with a dynamic inventory. Virtual machines need to find themselves in the Ansible inventory within a few seconds of being created, otherwise Ansible-pull would fail. The built-in AWS inventory worked for us out of the box, but we had to write our own inventory for OpenStack as the native OpenStack inventory was too slow.

Source of Truth

To always know the variables and roles applied to a virtual machine, we needed a source of truth. With Chef, all nodes and their attributes are stored in the Chef database, whereas Ansible has no equivalent by default. We decided to use the cloud as the source of truth and query node lists and tags through APIs of AWS and OpenStack and always trust this info.

Reports

For visibility, we had to write several Ansible callbacks.

InfluxDB callback sends Ansible-pull statistics:

  • How many tasks were changed
  • How many tasks were okay
  • How many failed, if any
  • How much time an Ansible run took

Statistics of all Ansible-clients can be easily checked in a Grafana dashboard for any time range..

Slack callback sends reports about failed and changed tasks. Unfortunately, there’s no way for callbacks to identify whether a failed task was rescued later, but we can work through that.

Attribute search in Ansible

The ability to search nodes by their attributes is a crucial feature for infrastructure teams. While Chef offers this feature out of the box, Ansible doesn’t. To solve this problem, we created a role that is included in all playbooks that pushes certain node variables to CouchDB. This has resulted in a database (Ansible-db) with a wealth of useful information.

For example, we can execute a query to list all Kafka nodes that aren’t running on Ubuntu20. This attribute search is even more convenient than in Chef, as Chef can only determine nodes from its own environment. To get a complete list in Chef, we had to loop over 20 servers, whereas, with Ansible-db, we can achieve the same result in a matter of seconds using a single query. Replacing Chef queries with Ansible-db in lots of scripts was a trivial task.

Secrets management

Ansible-vault — a built-in tool for storing secrets in Ansible — wasn’t suitable for our needs as we have too many secrets and a large team. It was only a matter of time before an unencrypted password would be pushed to GitHub.

Therefore, we decided to use Hashicorp Vault, which requires us to encrypt only one token with Ansible-vault at the time of environment creation. All other secrets are stored in a central, secure location. This posed many challenges, such as achieving high availability, backups, environment isolation (ensuring that virtual machines from one env couldn’t access secrets from another). As a side note, we also wanted to ensure that multiple environments could easily access some common secrets, how we did it is a topic for a separate post.

Inventory for local runs

One of the benefits of Ansible is the ability to run it from an engineer’s laptop in check mode. However, to do this, we needed an inventory, and dynamically querying all our different clouds was not an option. We decided to collect the inventory for every environement and push it to a central location where all “small” inventories are combined into one huge JSON file. Now, all engineers need to do is pull this precompiled inventory to their laptops. The inventory doesn’t have to be super up-to-date, as we only use it for development and troubleshooting on long-existing virtual machines, it’s not for service provisioning or new machines deployment.

Simultaneous runs of Chef and Ansible

For a while, we had to run both Chef and Ansible in parallel, which was duplicated work. However, it was manageable.

Migration strategy

Each sub-team had a few services to move to Ansible, and it was up to the service owners to choose how to migrate. Some simple services could be rewritten in one pull request. Chef-client would then be deleted from the virtual machines. Finally, Ansible would start to manage the same things. Larger services required several steps, moving some chunks of code to Ansible and gradually removing them from Chef. Some teams decided to build new Ansible-managed machines and destroy the Chef-managed ones, while others simultaneously managed the same things with Chef and Ansible.

Summary

Although the migration isn’t complete, we have already noticed several benefits. The speed of development has increased significantly, as new code can be tested right from the engineer’s laptop in check mode. We also have more convenient tools for querying data and better monitoring our infrastructure management. Not to mention, building new regions has become much easier. So far, most of the problems we were facing have been solved, and we are steadily moving towards our goal of deprecating Chef without any significant roadblocks.

--

--