A Data Engineer’s perspective on IaC

8 min readOct 20, 2018

In the era of digital transformation and cloud journey with DevOps culture shift we come across many new concepts and related tools and technologies almost on a daily basis. The jargons , concepts and related metaphors used to make my head spin when I was learning initially. It took me a while(years I would say) to grasp some of the concepts and still I am struggling to understand many concepts as everyone practices it very differently. This article is focused on my perspective on a very small part of DevOps which is Infrastructure Automation popularly abbreviated as IaC(Infrastructure As Code).If I am asked to explain IaC in one line I would say “define,deploy and upgrade infrastructure writing code”. During learning phase words like “snowflake servers”, “configuration drift”, “Idempotence”,”configuration management” etc are definitely confusing when you are not from from system admin background.

During early days in my career when I was working as a ETL developer ( data engineer is relatively a new job title) I was partially engaged in a project where the scope was to upgrade existing application server hardware as a part of server life cycle management policy.Yes, servers do have a life span like automobiles or cars do even if it has less mechanical moving parts but there is wear and tear associated and hardware get replaced in every couple of years for various reasons.System Admins would know this much better.My role in that project was very small and responsibility was just to validate that in the new server running the applications properly and making sure the configurations are correct , all connections working, users available, ssh keys present etc. and filling a checklist running some tests.The purpose of telling this is on-premises servers in data center also get destroyed and replaced by new ones.But in cloud environment the resources are meant to be disposable.Creating resources and destroying them after usage has become a usual pattern for economical reasons in cloud environment. Apple to apple comparison can’t be done between the servers we get in cloud and traditional servers.Honestly for a very long time I never noticed small “v” in vCPU we provision in AWS and was always thinking its like a CPU with multiple cores we used to use in traditional servers. The small v makes a lot of difference.🙇

If you are working as a DevOps engineer or even a application developer in DevOps culture or have gone through some materials or books there is a very high chance that you must have come across metaphor of “Pets vs Cattle”.This is very famous and overly used analogy to explain the on premises vs cloud servers.

Pets Model

Each and every server is given a nice name. You can recall from your old assignments about the sever names and how everyone in your team remember the name of the server including its IP. I can remember servers being given American Football team names like mets,rangers from my assignment. Pets are unique, lovingly hand raised , when they are sick veterinary doctors bring them back to health.

Cattle Model

Servers are given names as incremental numbers and identifies with tags or ids for any internal operation. For example etlserver001,etlserver002 like this. Each cattle is identical to each other and are identified by tags attached to their ears, and when they get sick they get replaced with new one to maintain the total count of cattle.Replacing a sick cow analogy I can’t easily digest or explain being an Indian and in how Indian culture values cows 😕.You see this kind of model usually clusters and in cloud environment.

Again in cloud technology is moving towards containerizing the applications and Kubernetes, Amazons Elastic Container Service are evolving technologies and may be we will hear more and more metaphors like this in near future. I can remember another metaphor “Flock of Birds” being used for Serverless Infrastructure where neither they have a name or tag , its just a unit of compute and its dynamically allocated. Now we will switch back to original topic of discussion.

Today, if you open any job portal and search for data engineering job , I am sure it would have DevOps mentioned as desired skill set and specifically tools like Cloud Formation , Terraform, Chef , Puppet, Ansible , SaltStack, Open Stack Heat. In fact companies are looking for knowledge on all these tools not a particular one .This is scary and overwhelming number of tools to learn and this often confuses me where to start from.

If you understand what IaC is of course you know what are the advantages and why everyone is making it a practice in the cloud environment. There are few categories of tools to implement IaC and the the blurry boundary between these tools makes them even further confusing.

Implementing Adhoc Scripts

This is a old school way of implementing IaC.For e.g. AWS offers different SDKs and you can call the APIs in your favorite programming language to setup the infrastructure. This can be a nightmare to maintain if you dont understand the developer’s thought process behind the code and don’t know the complete picture of the infrastructure.

Provisioning Tools / Orchestration Tools

Their primary focus is to provision resources in cloud. When I say resources it doesn't mean only servers, it can be a load balancer, autoscaling group, firewall / security group and any other resource you can think of. I am not aware of GCP or Azure so using AWS terminology and there would be similar concepts in other cloud providers as well. Mostly the provisioning tools follow immutable infrastructure. What this means any change in configuration doesn't actually change the server rather it creates a new server. Say I am using a provisioning tool(e.g. Terraform) to provision EC2 server. Later I decided to use a higher capacity disk volume attached to it. These tools won’t upgrade the existing server rather I would create a new one and destroy the old one. Each change results in a new resource. There are few exception to it but the immutable model is mostly creating a new resource. Remember the cattle analogy . We add a new cattle when a cattle is sick.

Cloud Formation for AWS, Terraform are most widely used tools for Provisioning in cloud environment. As Cloud formation is only for AWS the clear winner here in terms of popularity and usage is Terraform.

Configuration Management Tools (CM Tools)

The primary focus of these tools is to install and manage software on existing servers once the server is booted and running. This is powerful as it can install software in multiple servers in parallel and takes care of idempotence ( a service is installed and running, it will ignore if we try to do it again). Also these tools focus on mutable infrastructure as while installing software the servers become mutable and becomes no longer in same old state.Remember cattle analogy ? If you have to give vaccination to some cows then would you replace them or just give them vaccination ? Think of installing a security patch to running servers.For Provisioning tools there is no way to install something once bootstrap is over, recreating the server is the only option left. CloudFormation has stop and re-start option though with modified user data, CM tools are very much helpful in such scenarios.Some of the popular tools in this category are Chef, Puppet, Ansible, Saltstack.

If you read the above 2 paragraphs one more time you will notice I have used the term prime focus. So the tools in category has some overlap of functionality i.e. provisioning tools can do configuration management an vice versa. There is no clear boundary defined and you can use them interchangeably. For each of the tools mentioned above has ways to do IaC for all major cloud providers and you dont even need a second tool .But its a popular pattern for people to use a provisioning tool and a CM tool in conjunction and assign their core responsibility to each of them. Having said this I would agree we can write everything in user data section during EC2 boot strapping but using a CM tool is a better way of handling complicated steps during application software and environment setting.

Another classification on these tools are procedural vs declarative language. Procedural means it writes the code in steps and the steps are executed sequentially to achieve the end goal. On the other hand declarative means it defines the end state of infrastructure and tool figures out ways to achieve it.

Each of the above mentioned tool work differently and call AWS APIs under the hood and achieve the same goal at the end.

Do I need to learn all these tools ?

Not really. I would suggest to have a good understanding on Cloud Formation if you are primarily working on AWS and Terraform as well as it is widely popular and fun to learn. Ansible is easy to learn as a CM tool. Chef and Puppet has some learning curve and lots of settings. BTW I dont write any IaC as a part of daily job or I am no expert but this is my observation studying at various IaC codes and infrastructure. If you know the required parameters and how they are connected for different resources in cloud environment learning any IaC tool will not be a big challenge in my personal opinion.

Do you like using JSON format writing Cloud Formation or Terraform ?

My understanding YAML/YML is better format(though it has white space headache) for Cloud Formation and HCL for Terraform. A plugin in a good editor for code highlight and auto fill will help always . The reason why I am not inclined for JSON it that JSON is supposed to be a format which is generated by machine and interpreted by machine very well. You can’t write any comments in JSON(its not supported)but writing IaC configuration / template needs comments to understand it later .If you don’t write comments in any of the code you develop karma will hunt you down.Its a personal preference at the end.

Finally, a quick comparison

Thanks for reading till the end. If you liked this article, please Clap 👏 and share it with your network.

A Data Engineer’s perspective on IaC

Written by Deepak Rout