Our experience with Ansible: Strengths and weaknesses

Ghasemi
Sahab
Published in
4 min readApr 21, 2019
Photo by Christopher Machicoane-Hurtaud on Unsplash

It was about two years ago that we started experimenting with Ansible in one of our projects. Now Ansible is widely adopted in 3 projects, automating all aspects of our provisioning and deployment processes from basic setup (configuration of OS parameters, formatting and mounting disks, setting up LDAP, DNS and NTP, creation of bridges and VMS), to deployment of required infrastructures (Kafka, Hadoop, monitoring solution and Kubernetes) and finally deployment of our developed artifacts and preparation of their working environments.

Ansible is agentless and is working by pushing its python-based modules to remote hosts to run the tasks. This architectural choice makes Ansible easy to use and perfect to support running playbooks which requires coordination, but this central stateless design imply certain weaknesses for some use cases.
Here is what we’ve seen so far about strengths and weaknesses of Ansible, stressing those parts that one might overlook if he/she is in the early stages of working with Ansible.

1. Ansible language is simple and have a clear structure. In our environment it is warmly welcomed by both developers and devops, except some complains from devops “Why we should learn a new language while we can do it with shell scripting and parallel ssh”. The simplicity encourages developers to write their own deployment scripts instead of delegating this task to devops and clear structure helps devops to write more structured and maintainable scripts than bunch of shell scripts. Having a common language among developers and devops promotes collaboration among them e.g. they can work on a common codebase and review each other work.
2. The separation of playbook and inventory helps to separate the description of deployment process from the environment in which the deployment will happen. With this mechanism, we can write the deployment process once and deploy in different environments by providing the properties of that environment. We put all of these in a codebase with version control and review and everything.

Having the ability to describe the deployment process step-by-step with sync points is one of the strengths of Ansible and is highly required for the initial provisioning and deployment. Having idempotency combined with the ability to control this deployment process such as serial runs, makes Ansible suitable for upgrade and maintenance tasks.
3. There is an active community around Ansible. Besides the core, the community mostly contribute in Ansible modules and roles which covers working with wide range of technologies. In spite of this, we found that while most of the modules are very helpful for us, we usually have to modify the roles to be compatible with our standards and sometimes writing the role from scratch is more beneficial.
4. When the number of servers increases say in the order of hundreds, the deployment time with Ansible increases above our patience even using pipeline and large enough forks (and even free strategy). The main problem is that running each module on a remote host takes some setup time and we usually have lots of these tasks. This also getting worse in large deployments in which some servers behave strangely or have network issues. To circumvent this, we used Ansible pull to run some of our most time consuming roles. When using Ansible pull the whole role runs on the remote side but the run is only limited to that host. Running in the pull mode has its own consequences for example the run once, ansible_play_hosts or local action attributes have different effects than normal runs so you must make your roles pull friendly or sometimes separate pull friendly parts from other parts. Clearly these actions reduce the simplicity of working with Ansible.
5. When running a task on a set of hosts some of these hosts might not be reachable. In large deployments, we can’t always assume that all hosts are available so we must skip those hosts. These skipped hosts must be recorded somewhere so when the host become available we rerun the missed parts. There is no such issue in pull based approaches (but they solve a simpler problem). Another somewhat similar problem is related to handlers. Ansible has a handler mechanism which defers running a task at the end of play and it will be run once instead of several times when some changes are detected. The problem arises when a handler is marked to be fired but did not get a chance since e.g. the host is shut downed. Since this mark is not recorded in a durable store the state get lost and even running the play again may not fire the handler again since there is no change.

Two years ago we had some difficulties choosing the right initial solution for our provisioning and deployment requirements. Clearly after we made the choice, our understanding of strengths and weaknesses of Ansible become more precise and deep over time. We’ve learned that some of these weaknesses can be covered either by the way we look at the problem or by developing simple tools but some of them are more fundamental. Now we are almost happy with the choice and trying to solve the remaining problems. I hope you make a right choice for your projects too. Good luck!

--

--