Accelerating Data Engineering in the Cloud with Infrastructure as Code — A DataOps Perspective

Published in

Slalom Build

12 min readJul 15, 2020

Increased adoption of the cloud has resulted in the breaking down of many traditional roles. It’s also forced a lot of people out of their comfort zone. As a data engineer, I see this most often with database administrators (DBAs) who used to work with, for example, Oracle databases, but are now required to own their organization’s AWS RDS system.

The breaking down of these silos has also forced data engineers to fill more DevOps-esque roles; the common name for this is DataOps. If you want an in-depth view of what DataOps is, check out this article by DataKitchen on Medium:

DataOps is NOT Just DevOps for Data

One common misconception about DataOps is that it is just DevOps applied to data analytics. While a little semantically…

medium.com

TL;DR: While DataOps builds on DevOps, they are not the same. DataOps involves data pipelines, orchestration, and often many more tools than DevOps. The people working in DataOps are data experts, as opposed to software engineers. Test cases focus on data OR code, not just code.

One key component of DataOps is Infrastructure as Code (IaC). At a high level, infrastructure makes sense as it is something most can visualize. However, I quickly learned that the devil is in the details, and working with infrastructure — and its scalable deployment — was a very real education of sorts. This article walks you through a simple IaC use case.

Getting Started with Infrastructure as Code

My first foray into IaC was via AWS CloudFormation, a powerful tool to which I attribute the depth of my understanding of the AWS cloud. Azure has Azure Resource Manager (ARM), and GCP has Cloud Deployment Manager. IaC enables the scalable deployment of infrastructure via standardized templates, which are typically YAML or JSON files.

These templates contain Resources, as well as other sections such as Parameters and Outputs. There may be other sections, but these are commonly used in IaC template files. I will use AWS CloudFormation to explain these components. In CloudFormation, templates are used to build stacks.

Resources:

The actual infrastructure components being built.
AWS examples: EC2 instances, S3 buckets, IAM roles, etc.

Parameters:

Dynamic values to assign to resources.
AWS examples: resource name prefix, access keys, region, etc.

Outputs:

Any values to be accessible from the main template upon successful creation.
These can be viewed in the AWS console, imported into other stacks, or responses returned to describe stacks.
AWS examples: EC2 instance IP address, S3 bucket path, VPC ID, etc.

I will further illustrate IaC templates using CloudFormation in the sections below.

Building Sandboxes for Users

I have been on two projects with distinct clients in which there has been an ask to create sandboxes for users. One was to be used by application developers who wanted access to various AWS services such as RDS, DynamoDB, and S3 in a safe environment. The other was intended for data scientists and utilized machine learning tools connected to a data pipeline.

These projects had some overlap; the major overlapping resources were S3 and EC2 as well of course IAM. I will utilize a simple sandbox use case as an example for Infrastructure as Code consisting of an EC2 instance for users, an S3 bucket, and the necessary IAM permissions.

AWS CloudFormation Template, Explained

I will demonstrate our simple sandbox infrastructure using three components of a CloudFormation template: Parameters, Resources, and Outputs.

At the top of the template is the Template Format Version and Description. The Description must always follow the Template Format Version. For more information, visit this link:

Template Anatomy

A template is a JSON- or YAML-formatted text file that describes your AWS infrastructure. The following examples show…

docs.aws.amazon.com

The next section is Parameters. These are dynamic values which are inputted at the time of stack creation and can be referenced throughout the template. In our case, we have parameterized (AKA: requested the user to input) their username, the name of their key-pair for SSH access to the EC2 instance, and the subnet ID for the EC2 instance.

Next is Resources. This is the main part of our template; it is what’s actually created. For our simple sandbox use case, our infrastructure consists of an S3 bucket for artifact storage, an EC2 instance for computing, and IAM resources. The IAM resources are required so that the EC2 instance can access the S3 bucket as well as its objects.

The final section in our template is Outputs. In this case, as we are outputting both the private and public IP addresses of our EC2 instance to make them easily accessible.

Despite this being a simple use case, there were some key decisions made around security and resource utilization which guided the formatting and content of this template.

It was important to ensure the security group for the EC2 instance allowed inbound SSH over port 22. It’s generally good practice to not allow inbound access from anywhere, so it’s recommended to either add a specific CIDR range or select “My IP” from the dropdown menu if the instance is only to be accessed from your computer.
The EC2 instance required an IAM role which provides it with S3 access in order to access our bucket as well as the objects in the bucket. In this template, we have allowed full S3 access to all S3 buckets and objects. This can be further restricted by either providing specific S3 permissions to the role, as well as only allowing access to specific S3 buckets and objects. The latter can be achieved by providing a list of ARNs for allowed resources.
For this use case, the name of an existing key-pair was provided to the CloudFormation stack which is to enable SSH access into our EC2 instance. This is fine, and secure. Security credentials (ie: passwords, keys, etc) should not be provided to resources in plain text; this should always be avoided.
This template could also have been divided into two separate templates: one for user onboarding which contains our more permanent resources (S3 bucket, IAM) and one for ephemeral resources (EC2 instance). This would mean that any time the user wanted to use an EC2 instance, a stack containing this resource and required information would be built. Any time an EC2 instance had been inactive for a certain amount of time, it would be terminated. This works well, but the whole system is simplified by including everything in one resource; the EC2 instance would then be stopped if inactive and restarted by the user (instead of rebuilt via CloudFormation). This decision depends on frequency of use for EC2 instances, the use of elastic IP, and the size of any EBS volumes attached to the instance. Note: stopping EC2 instances requires the use of a CloudWatch Alarm to identify inactive instances. This is not covered in this article.

We can run our CloudFormation template, therefore building our stack, either in the console by following the prompts or via the CLI using the following command:

aws cloudformation create-stack --stack-name cfn-iac-stack          --template-url s3://sample-bucket/aws_cfn_sandbox.yml --parameters ParameterKey=Username,ParameterValue=test-user ParameterKey=UserKey,ParameterValue=test-key ParameterKey=Subnet,ParameterValue=subnet-id --capabilities CAPABILITY_IAM

In the above command, the following values are decided upon by the user:

Stack Name (cfn-iac-stack)
Template URL (s3://sample-bucket/aws_cfn_sandbox.yml)
Parameter Keys and Values (as per template)
Capabilities (to enable the creation of IAM resources)

For more information on the create-stack command, use the following link:

create-stack - AWS CLI 1.18.65 Command Reference

Creates a stack as specified in the template. After the call completes successfully, the stack creation starts. You can…

docs.aws.amazon.com

Mounting S3 Buckets on EC2 Instances (the secure way)

In one of my projects I was given the unique task of mounting a user’s S3 bucket on their dedicated EC2 instance. I will say this loudly and clearly: S3 was not built for this, and this goes against best/recommended practices. However, because I took the time to work on it and it may be useful to others, I will leave a step-by-step guide on how it was achieved here.

There are a few options available for going about this task. AWS now offers a service called FSx for Lustre, which integrates with S3 and enables reading from and writing to S3 objects, presented as files. This was not the option I used, mostly because I was unable to get it to work in the client’s environment. However, there are also cost implications of using FSx for Lustre, which can be viewed here:

Amazon FSx for Lustre Pricing - Amazon Web Services

With Amazon FSx for Lustre, you pay only for the resources you use. There are no minimum fees or set-up charges. For…

aws.amazon.com

We overcame these obstacles by using an open-source tool called s3fs, which mounts a specified S3 bucket (at the root-level) on an EC2 instance. We were very happy with the performance of this tool, and encountered no financial spend issues with it. Better yet, many systems actually provide pre-built packages; these can be viewed on the s3fs GitHub project.

s3fs-fuse/s3fs-fuse

s3fs allows Linux and macOS to mount an S3 bucket via FUSE. s3fs preserves the native object format for files, allowing…

github.com

s3fs Permissions

It is possible to provide authentication to the EC2 instance for S3 mounts in a few different ways.

AWS credentials file in ${HOME}/.aws/credentials
Custom passwd file in: (a) User’s home directory ${HOME}/.passwd-s3fs, or (b) System-wide /etc/passwd-s3fs
Using the iam_role option in the mount command

Both option #1 and #2 require the AWS Access Key and Secret Key in the format: ACCESS_KEY_ID:SECRET_ACCESS_KEY. As this is a major security vulnerability because credentials should never be accessible and stored in plain text EVER, we decided to go with option #3.

The s3fs command to mount an S3 bucket on an EC2 instance follows the format below:

s3fs S3_BUCKET /PATH/TO/EC2_MOUNTPOINT

There are a number of options -o which can be specified as part of this command. The iam_role option is one of these, and follows the format:

s3fs S3_ BUCKET /PATH/TO/EC2_MOUNTPOINT -o iam_role=”IAM_ROLE”

The important piece here is that the IAM_ROLE has the correct permissions on the S3_BUCKET. As this is the IAM role created in our CloudFormation template, this is easily configurable. If the IAM role were to be existing as opposed to created in our IaC, it is necessary to verify the role has the following permissions, or is updated accordingly:

Following this convention enables us to securely mount our S3 bucket. The entire set of commands would be as follows:

sudo amazon-linux-extras install epel -ysudo yum install s3fs-fuse -ymkdir /home/ec2-user/s3-mounts3fs test-username-s3-bucket /home/ec2-user/s3-mount -o iam_role=test-username-ec2-iam-role

In the above command the user has decided to make their S3 mount location on their EC2 instance /home/ec2-user/s3-mount. They have specified the username parameter to be test-username, which is part of the naming convention for both the S3 bucket (test-username-s3-bucket) and the EC2 instance IAM role (test-username-ec2-iam-role).

CloudFormation EC2 Instance User Data

There is an option in the CloudFormation EC2 instance resource called User Data, which is a script provided to the EC2 instance to be run at runtime when the EC2 instance is starting up. We can leverage EC2 instance user data to further automate our infrastructure by running common commands without any user input.

In the case where we are mounting an S3 bucket on our EC2 instance, the updated EC2 Instance resource with User Data would look like:

Enter Terraform

Once I felt comfortable with CloudFormation, I felt like I could build anything; that is, anything within the AWS cloud. Don’t get me wrong, CloudFormation is a very powerful tool which is highly effective for building AWS infrastructure. However, when I was introduced to Terraform, it blew me away. I really felt that I could build any infrastructure I wanted, anywhere. And I was (mostly) correct in feeling this way.

Terraform is an open source tool that empowers users to build Infrastructure as Code in any cloud, with any service because it’s both lightweight and extremely powerful. As this use case does not involve more than one cloud provider or external services, it may not be necessary.

AWS CloudFormation’s templates are equivalent to Terraform’s modules. Terraform modules are .tf files and have a JSON-like format. The file structure for our simple sandbox resembles the following, except for the README and naming.tf files:

Courtesy of: https://www.cloudreach.com/en/resources/blog/how-to-simplify-your-terraform-code-structure/

Here, our Parameters, Resources, and Outputs are split into variables.tf, main.tf, and outputs.tf respectively. For more information on Terraform file structure, check out the following link:

Creating Modules - Terraform by HashiCorp

A module is a container for multiple resources that are used together. Modules can be used to create lightweight…

www.terraform.io

Our variables.tf file looks as follows:

The main.tf file:

And finally, the outputs.tf file:

As you can see, this module includes the User Data script shown above as used in the CloudFormation template. We apply the same script to our Terraform EC2 instance resource.

In order to run our Terraform module, the following command is used (I was running mine in Windows PowerShell):

terraform apply -var=“username=test-username” -var=“user_key=test-key” -var=“subnet_id=example_subnet”

The variable values are decided upon and inputted by the user. For more information on Terraform CLI commands, please see the following link:

Configuration Language - Terraform by HashiCorp

Terraform uses its own configuration language, designed to allow concise descriptions of infrastructure. The Terraform…

www.terraform.io

If you’re curious to learn more about Terraform, I would highly recommend checking out the Learn section on the Terraform website.

Terraform Tutorials - HashiCorp Learn

Learn how to provision, secure, connect, and run any infrastructure for any application.

learn.hashicorp.com

Hello, Packer

The sandbox environment infrastructure can be further improved upon by building a machine image for our EC2 instance, so that most of the work is taken out of run-time and moved to build-time. It significantly streamlines the required user data script and ensures consistency across EC2 instances. This is where Packer comes in.

Packer is a system that enables the packaging of components for machine image creation. For our simplified sandbox use case, this allows us to remove the first three lines of our User Data and add them to our Packer Provisioner. The result of this is an EC2 instance on which s3fs Fuse and its dependencies are already installed; the User Data must only mount the specific user’s S3 bucket on their dedicated EC2 instance. Again, this use case is simplified, so our Packer JSON file reflects this.

The first component of the Packer file is our variables. In this case, we are parameterizing our AWS Access Key and Secret Key, which Packer reads from our AWS credentials file.

The second component is our builders. This section provides the specifications of our AWS machine image. The base image we are using is the Amazon Linux AMI (which we used in both our CloudFormation and Terraform scripts above). The AMI name must be unique, so using a timestamp function {{timestamp}} helps to meet this requirement.

The final component of our Packer JSON is the provisioners section. This is where we provide our image customization. In this case, we are using shell script as per the User Data in our previous examples. The first line, sleep 30, ensures that the instance OS has time to boot up before we begin updating or installing anything. Each line is separated by commas and starts with sudo.

Prefixing each line with sudo differs from the CloudFormation template and Terraform module as User Data is automatically run as sudo. Therefore, we omit it from our User Data scripts. However, our Packer file provisioners are not automatically run with sudo so we must add this to each line. If this is not included, the Packer script will return a permissions issue and no artifacts will be created.

In order to use this new image, we would update our EC2 instance resource with the new AMI ID as well as the updated User Data script, which would exclude the installation steps.

Delivering the User Sandboxes

It goes without saying, but a key component of DataOps is Ops, AKA operations, AKA the actual implementation and use of these systems. This means that outlining the process around provisioning, maintaining, and decommissioning these sandboxes is part of this solution.

Because we combined all of our resources in one template, this process would be rather simple. It would probably look like the following:

And there you have it: signed, sealed, delivered, IaC is yours!

If you have any questions, please leave them in the comments below! Also, if you would like any follow-up articles that do deeper dives into any of the above topics, leave a comment. Finally, if you have a request for another high-level DataOps intro article on a different topic, leave a comment!

Thank you!