Rules for writing Ansible roles

Ghasemi
Sahab
Published in
6 min readMay 31, 2019
Photo by Alice Pasqual on Unsplash

Ansible roles are self-contained and reusable packages built to deploy or configure a specific software or tool (we call it the target service here). Ansible roles can easily be shared using platforms like ansible galaxy. You can find an ansible role for almost everything. Indeed this is the major advantage of using configuration management tools like ansible in contrast to writing everything from scratch using custom scripts.
Unfortunately since there are no strict guidelines and processes for writing and distributing roles, there are wide ranges of roles in terms of quality and often we find it difficult to use diverse roles in one project without changing them. In our company we agreed on 32 rules to be followed when someone develops a custom role. These rules are checked in a peer review process and after the review completes, an automatic CI process comes into play which tests and packages the role. Here is the compilation of some of those rules in 5 more general guidelines.

Defaults
Nothing is more surprising than seeing a variable in the middle of a role without it is mentioned anywhere else. Define all variables that might be changed in your defaults, some of them might have defaults some of them might not. This makes defaults the single source that defines all your role’s inputs. Put comments, samples, and possible values or desired structure in the comments above each variable. In this way we can generate docs systematically from our roles.
If a variable is required or must have a special value add some asserts at the beginning of your role logic and fail if the value is not what you expect.
Another important consideration is which vars must have default values. Initially we tried convention over configuration principle by providing default values for most vars. E.g. default location of data directory of the service is under /var. This makes usage of role simple since user must not provide much input to role to work correctly. This shines well when you have two roles a and b where role b needs the service provided by role a. Role b assumes the default vales of role a such as its port and host group so you can put both roles into your playbook and they start working with each other without much configuration. Over time we realized that this results in surprising consequences since those defaults hide some facts from the role user e.g. those data gradually fill the OS disk unnoticeably or role b starts working with an unintended service, so we ended by tagging some vars as important but that was not enough. Our current convention is that if you feel a var needs special attention make it required (don’t give default and ensure it is defined at the beginning of your role logic)
Finally, if the service configured by your role has some defaults about its configuration and you make those variables configurable, stick with the service defaults unless you have a very strong reason to change it or make it required to be given by your user

Test it
Without test your role is just a bunch of static files. Test ensures that your role works and also demonstrates one sample and typical usage. With such a test, others can improve your role or fix bugs without breaking anything. Using molecule as the test tool is a common choice these days. With molecule you can define your test environment. It provisions the required VMS and applies a preparation playbook on them. It then runs your tests specified as ansible playbooks or python codes.
Besides this type of test + some form of lint checking which runs on build time, role must have two other types of tests which run on runtime. Pre checks and post checks. Prechecks test that role gets as input what it expects and its prerequisites are properly available in the target environment. Post checks on the other hand test that role properly applied on the target environment with expected effects. As an example suppose you are writing a role to configure OS to work with a LDAP server. In precheck you must test the connection to LDAP server and in the post check you can check your configs is correct by checking ldapsearch actually works. In your molecule test you can bring an LDAP server add an entry and ensure that after your LDAP client role applied one can get that entry with the same value.
Since post checks run on the target environment every time your role is applied, prefer to put tests on your post check rather than on your molecule test unless test makes environment dirty or needs some assumptions or prerequisites that is not necessarily available on every environment.

Make it configurable
One common source of need for changing roles is that you cannot config the target service as you wish. There are generally two approaches to allow custom configuration by user of your role. Either define a variable for each config item or take a dictionary of key values as config items. The latter has the advantage of not limiting to some items and the name of config items (keys in that dict) are the exact name of config item in the target service. On the other hand the former results to more clear config items with each item has appropriate defaults, docs, and checks and you can cleanly use that var in various places of your role. With such advantages it seems we must define important/common vars with the former approach but allow custom configs by using the latter approach.
Since vars in ansible are internally getting flat for each host, start var name with the name of the role. This way you avoid the possibility of clashes in large inventories.

Keep it simple
This may sound contradictory with the above rule but don’t define too many vars and unnecessarily flexibility in your role since it makes your role complex and hard to understand and use. If you internally wants a var throughout your role put it in vars. To have simpler roles, consider a specific goal for you role. Don’t define a super role that do many things, instead break it into several smaller roles.

Consider practical issues
We want to use the roles in production environments with lots of unpredictable events so we must write our roles with this concern in mind. There is an open ended list of considerations here, some of them is as follows.

  • If in your role you call an API of some external service you must consider an appropriate timeout and retry in case of failure. If there are multiple instances of that service in your environment, you must either put that service behind some HA mechanism or implement switching to different instances in your role.
  • If your role takes a list of hosts as input, some of them might be unreachable. So for example if in your precheck you check that some service is available in all of them, properly handle the cases where some of them is not available instead of failing, if that is not strictly required by the service of your role.
  • Tag important or time consuming parts of your role such as prechecks or postchecks with appropriate tags so user can skip them or only run those parts just in case.
  • We sometimes need to rerun a role on a set of hosts due to some changes in configurations or just to ensure everything is OK. In such situations we are interested on what new changes the role causes. Hence, your role must emit changes only when something is actually happened. Using molecule you can automatically run idempotency test which detects some of these issues by reruning your role and asserting that no change is emitted.
  • If your role do sth dangerous such as formatting or deleting, either refactor these parts into a separate role or if it is not possible guard these parts with vars or prompts. In other words make it hard for your user to do unwanted or surprising things.

Some of these rules are disputable and you may not even agree with them, that’s OK. The important point here is to build a baseline for further consideration and discussion and improve this baseline overtime until we reach a solid foundation and patterns for development of roles. Until then good luck!

--

--