Ugly way to cram your staging into random servers (Ansible)

Published in

OpsOps

3 min readApr 18, 2020

I call this ‘dirty ansible’. It’s not the shiny best practices, but it so convenient, that it’s hard to resist.

The problem: You run your playbooks at (normal) staging. There is an environment for staging, there is an environment for production. You’ve tested the code on the staging, and it looks good. But who’s gonna check your production inventory?

From my Ansible experience, errors in environment are the most devastating. If you have a wrong task or wrong variables in templates, the task fails and that all.

But imagine for a second you put a string instead of a list in the inventory. And someone is iterating against that …string. Good luck restoring your production after that. Or, may be, you’ve just forgot to add that new group into production.

Anyway, how to test a production inventory? If you change anything in production inventory to change hosts, etc, it’s no longer a production.

Here is the dirty Ansible trick I start to use few weeks ago for that. I called it ‘hijacking anisble_host’.

The core idea:

ansible-playbook -i prod.yaml site.yaml -f 1 -e ansible_host=my.VM

There is so much in this line I need to explain it one by one.

We use the production inventory. All variables, hostnames, etc.
We assign ansible_host to some sacrificial host for test purposes. That means that for every host in the production inventory Ansible goes to the sacrificial VM. Emphasis on every. If you have 100 hosts in your production inventory, all of them suddenly become your ‘my.VM’ from Ansible point of view.
-f 1 is a neat trick to avoid lock conflicts between ‘hosts’. If two hosts want to install some package, normally there going to be a lock conflict with package manager (as both of them are executed on the same my.VM). -f 1 serialize all access, therefore, removing all conflicts.

Basically, it is. You have self-confined ‘production’ on a single VM. If you have some craziness in the production inventory, it will pop out here. It’s not a silver bullet (f.e. if you have resource id conflict, f.e. /dev/sda for each host, this trick will fail), but it very similar to smoke test.

If you have resource conflict and your servers can work alone (without neighbors), you can use secondary form of the same trick:

ansible-playbook -i prod.yaml site.yaml -f 1 -e ansible_host=my.VM --limit srv1
# resinstall srv1
ansible-playbook -i prod.yaml site.yaml -f 1 -e ansible_host=my.VM --limit srv2

There is one more small addition here I use. I assign the IP addresses of the production into the sacrificial VM. It’s safe in my case (we have port security for VMs which prohibit egress traffic from unauthorized IPs), and it allows me to get a full (non-functional form outside) copy of my production.

The playbook for this is simple:

---
- hosts: all
  gather_facts: true
  tasks:
    - name: check if there is a single server
      run_once: true
      assert:
        that:
          - ansible_play_hosts_all|length == 1
        msg: use one server at a time
    - name: Assign hijacked IP
      become: true
      ip_address:
        address: "{{ lookup('dig', inventory_hostname) }}"
        name: '{{ ansible_default_ipv4.interface }}'

It requires to have a working hostnames for production.

Combination of all those tricks allows me to run the final smoke test before the actual deploy into the production. It’s dirty, but it’s good.

Ugly way to cram your staging into random servers (Ansible)

Written by George Shuklin