Understanding Ansible Strategy

I believe this must be one of the least explored corners of ansible and to be honest, surely one of the vaguely documented portions of ansible as well. Lets start with, what is ansible strategy. Ansible strategy is how you control the execution flow of a playbook, documented here. Why would you want to control it? For +90% of the use cases default will work fine, and it works.

Most common use case for controlling the flow strategy when you don’t want the same action to be performed in more that one server, for example, rebooting a cluster node of a cluster. For the most part that always works, magically, without any understanding of the inner workings. As per the docs serial attribute at the playbook level and/or throttle at the task level can control the parallelism of the playbook execution.

Recently I got the unique opportunity to work on a playbook to upgrade firmware level of large number of network switches. In this case to challenge was to make the playbook go fast, or go as fast as possibly can, which proved to be more challenging than I thought. This is a breakdown of what I’ve learned in that exercise.

But I do not have 100 servers to test this scenario, so I came up with the following.

✓ I need an inventory with 100 hosts… done

for x in {1..100}; do echo host$x; done > inventory

✓ Need a task which will run for a long time, not waiting 15 mins (like the TFTP copy, 10s wait was enough for this demonstration)

Note: I’m using the same script with the two different symlinks, so I can change one script easily, and I can identify both the commands individually. I’m logging the output to a file, so I can see when the command was run and finished.

$ cat longcommand.sh#!/bin/bash
inventory_host=$1
printf "$inventory_host : $0 starting\n" >> taillog
sleep 10
printf "$inventory_host : $0 finished\n" >> taillog

$ ln -s longcommand.sh firstcommand
$ ln -s longcommand.sh secondcommand

✓ I need a playbook with few tasks, preferably long running…

Note: I do not have 100 servers to test this playbook, in this case I’m going to delegate all my tasks to localhost. This way I’m tricking Ansible engine to create multiple TQMs without actually having inventory targets.

---- name: long running job
gather_facts: false
hosts: all
tasks:
- name: first task
command: "./firstcommand {{ inventory_hostname }}"
delegate_to: localhost
- name: second task
command: "./secondcommand {{ inventory_hostname }}"
delegate_to: localhost

This script allowed me to monitor what happens under the hood in the playbook, which it’s running, by looking at the taillog file.

Average day for a playbook

I can run this playbook, just like any other:

ansible-playbook -i inventory playbook.yml

Without any additional configurations when you run the playbook, this is what you will see:

host3 : ./firstcommand starting
host1 : ./firstcommand starting
host4 : ./firstcommand starting
host5 : ./firstcommand starting
host2 : ./firstcommand starting
host3 : ./firstcommand finished
host1 : ./firstcommand finished
host5 : ./firstcommand finished
host4 : ./firstcommand finished
host2 : ./firstcommand finished
host6 : ./firstcommand starting
host7 : ./firstcommand starting
host8 : ./firstcommand starting
host9 : ./firstcommand starting
host10 : ./firstcommand starting
...
...

Looking at this output:

  1. I can see only firstcommand is run first. although host1 to host5 completed the first task, the playbook continues on to completing host6 through to host10 again with the first command, rather than starting on task second task .
  2. There were maximum of 5 tasks ran at any given time..
  3. And magically all my tasks are synchronised, which is nice to demonstrate this concept, but in real life jobs don’t finish at the exact same time even though they start the same.

I don’t have enough Forks

Let’s start with the limit of 5 jobs. This is controlled by the ansible setting forks in the ansible.cfg file or using the environment variable DEFAULT_FORKS , which translates to 5 queues Ansible tasks can run on. Both serial and throttle are still bound by this limitation. For example, if I set the serial to 10 in my playbook, without changing forks:

---- name: long running job
gather_facts: false
hosts: all
serial: 10
tasks:
- name: first task
...
...

My results on the taillog remains the same:

host3 : ./firstcommand starting
host4 : ./firstcommand starting
host5 : ./firstcommand starting
host1 : ./firstcommand starting
host2 : ./firstcommand starting

host3 : ./firstcommand finished
host2 : ./firstcommand finished
host5 : ./firstcommand finished
host4 : ./firstcommand finished
host1 : ./firstcommand finished
host8 : ./firstcommand starting
host6 : ./firstcommand starting
host7 : ./firstcommand starting
host9 : ./firstcommand starting
host10 : ./firstcommand starting
host7 : ./firstcommand finished
host6 : ./firstcommand finished
host8 : ./firstcommand finished
host9 : ./firstcommand finished
host10 : ./firstcommand finished
host1 : ./secondcommand starting
host3 : ./secondcommand starting
host4 : ./secondcommand starting
host5 : ./secondcommand starting
host2 : ./secondcommand starting
host1 : ./secondcommand finished
host2 : ./secondcommand finished
host5 : ./secondcommand finished
host3 : ./secondcommand finished
host4 : ./secondcommand finished
host9 : ./secondcommand starting
host6 : ./secondcommand starting
host7 : ./secondcommand starting
host10 : ./secondcommand starting
host8 : ./secondcommand starting
...
...

In this case, the playbook will process 10 hosts at a time, completing hosts in batches of 10.

Important: serial is merely a way of processing a playbook in batches of hosts rather than parallelising.

If you want more jobs to run at the same time, use forks or DEFAULT_FORKS env.

Strategy

In my case not only I wanted to run 50 switch upgrades at any given time, I wanted to reboot the switches which have successfully copied the firmware, without having to wait till we complete TFTP copy to all the switches.

Let’s add some randomness to the longcommand.sh like following:

#!/bin/bash
inventory_host=$1
printf "$inventory_host : $0 starting\n" >> taillog
sleep $(expr 2 + ${RANDOM:1:1})
printf "$inventory_host : $0 finished\n" >> taillog

This would simulate exactly how the real world tasks would run, and if we run the same playbook as before (serial: 10 ), my output looks like the following.

host2 : ./firstcommand starting
host3 : ./firstcommand starting
host1 : ./firstcommand starting
host5 : ./firstcommand starting
host4 : ./firstcommand starting
host1 : ./firstcommand finished
host6 : ./firstcommand starting

host2 : ./firstcommand finished
host7 : ./firstcommand starting
host3 : ./firstcommand finished
host8 : ./firstcommand starting
host4 : ./firstcommand finished
host5 : ./firstcommand finished
host9 : ./firstcommand starting
host10 : ./firstcommand starting
host6 : ./firstcommand finished
host7 : ./firstcommand finished
host9 : ./firstcommand finished
host8 : ./firstcommand finished
host10 : ./firstcommand finished
host1 : ./secondcommand starting
host2 : ./secondcommand starting
host4 : ./secondcommand starting
host3 : ./secondcommand starting
host5 : ./secondcommand starting
...
...

As we can see, when one host completes ( host1) another host started ( host6) but second task was not started until the batch is complete. In my case the playbook would not start rebooting my switches until all 10 switches have their TFTP is downloaded… which is not the optimal path.

The ansible flow of the playbook is governed by the strategy plugin used, which defaults to linear . As at ansible 2.10.7, Ansible provides 4 strategy plugins.

→ ansible-doc -t strategy -l
debug Executes tasks in interactive debug session
free Executes tasks without waiting for all hosts
host_pinned Executes tasks on each host without interruption
linear Executes tasks in a linear fashion

For the switch upgrades, what I need is the free strategy for the playbook, where tasks can continue without waiting for all the hosts in the batch. Let’s change the playbook to look like the following:

---- name: long running job
gather_facts: false
strategy: free
serial: 10
hosts: all
tasks:
- name: first task
...
...

Running this playbook I get:

host5 : ./firstcommand starting
host2 : ./firstcommand starting
host3 : ./firstcommand starting
host1 : ./firstcommand starting
host4 : ./firstcommand starting
host5 : ./firstcommand finished
host1 : ./firstcommand finished
host6 : ./firstcommand starting
host7 : ./firstcommand starting
host2 : ./firstcommand finished
host8 : ./firstcommand starting
host8 : ./firstcommand finished
host9 : ./firstcommand starting
host6 : ./firstcommand finished
host10 : ./firstcommand starting
host10 : ./firstcommand finished
host9 : ./firstcommand finished
host1 : ./secondcommand starting
host3 : ./firstcommand finished

Still this outcome is not very different to previous. Even with a free strategy in this setup my second task did not start until all the first tasks are completed. The problem here is that although there are hosts completed the first tasks, there are not enough queues to start the second tasks. Which leads to the conclusion, for free strategy to actually work, you need to tweak forks and ideally with serial to control the flow.

For tuning free strategy, best approach would be to size the following:

forks or DEFAULT_FORKS > serial > throttle

This round we will change the forks=20 which will create 20 queues. But we do not want the first task to consume all the slots, therefore we set serial: 10 . Looking at the taillog :

host4 : ./firstcommand starting
host5 : ./firstcommand starting
host3 : ./firstcommand starting
host1 : ./firstcommand starting
host9 : ./firstcommand starting
host2 : ./firstcommand starting
host6 : ./firstcommand starting
host8 : ./firstcommand starting
host7 : ./firstcommand starting
host10 : ./firstcommand starting
host4 : ./firstcommand finished
host6 : ./firstcommand finished
host4 : ./secondcommand starting
host6 : ./secondcommand starting
host5 : ./firstcommand finished
host1 : ./firstcommand finished
...
...

This execution path is truely free flow and dynamic. However, these settings will be ideal for this playbook, but might not work for more complicated playbooks.

Conclusion

TL;DR; Ansible provides several mechanisms to control the flow of a playbook. Each attribute would control a specific aspect of the flow, but alone their use-cases are limited. To optimise the flow of a playbook, we would need to consider multiple factors, and hopefully this story will help you optimise your playbooks. In a nutshell if you’re using a free strategy, either you need to have more forks than your inventory items or serial jobs for it to actually perform as described.

P.S: All the scripts available @ https://github.com/simply-ansible/ansible-strategy

--

--

Kosala Atapattu (කෝසල අතපත්තු)

I'm a IT Consultant with over two decades of operational expertise. I'm a UNIX fanatic. I love Ansibe. I love anything with containers. Pythonista!!