A Tale of Configuration Management

OpsLyft
Guardians of Cloud

--

Like in any growing product startup, tech stack also evolves with time. It’s iterated over and over (sometimes from scratch) as it scales to achieve it’s engineering milestones. These milestones were more of definitive engineering processes and practices which were validated and confirmed to be fitting our use cases. One such evolution which I experienced was of Configuration Management.
This article talks about my journey at Indix HQ of Configuration Management and how we ended up using Ansible as a standard for infrastructue creation, provisioning and deployments.

Previous State

As of circa Aug 2015, we at Indix used a variety of tools and languages to take care of infrastructure creation, provisioning and deployment.

  • Infrastructure Creation: AWS command line tools, fog, knife-ec2
  • Provisioning: chef, chef-solo
  • Deployment & Orchestration: Capistrano, fabric, shell scripts, Go-CD commands, beanstalk etc.

There were some shortcomings with this approach:

  • Failures happened since there was no tight coupling between infra creation, provisioning and deployment. Failure at one stage leads to cascading failures later.
  • The tooling to get things in production is pretty complex. Understanding multiple tools and getting familiarity with a new language ( ruby / python ) for most developers is not optimal. This leads to less participation from a larger group of developers in taking care of the systems built. Also it causes a lot of burnout for the DevOps team.
  • We didn’t have any single place to know the source of truth of the system. We largely use chef in a pull based mode. This implies that agents keep polling central chef server to know what changes are required to be applied to the system. We experienced state discrepancies where github code, chef server state and machine internal state don’t match. This happened either due to lack of proper checks and balances in the current way of working, developer negligence or sometimes even due to failure during execution of chef-agent which fails silently without raising any alerts. The end result of this invariably leads to production downtime / failure of some or many of our services. This apart we have also seen central chef-server becoming the single point of failure leading to bad state of our systems.

Our Experience & Learning

  • Too many tools to learn which have significant learning curve. Also very tight coupling of provisioning across different teams / projects.
  • Extremely difficult to quickly create a exact replica of production environment to experimental so to try new code changes.
  • Difficult to tweak with hardware ( amazon machine types & volume types ) which is required for any tuning and battle-hardening any backend system .
  • No consistency or simple way to configure some of the basic things ranging from monitoring, logging, system sanity checks
  • Neither idempotent nor immutable. This implies that tomorrow we can’t simply rerun our entire infra creation, provisioning and deployment in a single way multiple times and be certain that things will keep on working fine.

Besides the aforementioned pain points, we also realised that there were some things that we would like to have a quicker developer environment setup using virtual boxes or docker. This shall provide us the capability to quickly test things locally and write integration tests.

So to paraphrase we wanted a system which should be:

  • Dead simple. This means very low learning curve and more developer participation. The indirect consequence of it is helping people move fast and get things done rather than being dependent on the devops team entirely.
  • Single (or minimal ) tool for infra creation, provisioning and deployment
  • One way to do things. Not 5 different languages and frameworks. Something like YAML.
  • Building idempotent and immutable infrastructure. It means you can run the same script to create, provision and deploy infra multiple times and it should simply work. This also means that recreating infra, provisioning an deployment shouldn’t be a exercise in frustration but becomes a part of the way we operate
  • The above point ensures that it is easy to replicate complete environments with minimal overhead. This also ensures that our continuous delivery system is used to just trigger and doesn’t have any state or intelligence
  • Ability to quickly create local dev setup, test environments and write integration tests

Ansible Ecosystem

We actively followed ansible for over a month plus before seriously trying it out. It seems to solve all the problems that were described earlier in a really simple and elegant fashion.

The major problems that it addresses are:

  • Single place to create infra, provision, deploy and orchestrate.
  • You write only in YAML.
  • Primary mode of operation is push based mode which means post deployment system is immutable. It also gives flexibility to be used in pull based mode which is useful for say Hadoop cluster where nodes may come and go. However in both cases github source is single source of truth.
  • Great documentation and solid pre built modules. So for complex things like machine creation and ensuring that only a single machine in a group exists you dont’ have to do anything. Most modules written are idempotent. This is not true just for provisioning ( where it is true for chef & puppet also ) but for even infra creation and deployment.

Current State :

As of now, most of Indix’s infra is under configuration management via Ansible. This includes migrating legacy systems from no configuration management to Ansible based setup and a few Chef based setups to Ansible based. Any new system that’s being built by developers is brought up using Ansible.
Owing to it’s simplicity, the best and perhaps the most important milestone that we are able achieve is that developers now build their systems end to end, meaning that they not only focus on writing application code but also write Ansible scripts for the application’s infrastructure creation, provisioning and deployments. This is a notable cultural practice which has been standardised which also eases and distributes out the ownership and responsibility of DevOps team with developers.

We at OpsLyft help organisations meet their goals for DevOps and make them win with it. Get in touch with us at contact@opslyft.com for further assistance and help.

--

--

OpsLyft
Guardians of Cloud

On a mission to make cloud simpler for organizations across the globe. Join us on our journey: www.opslyft.com