Terraform, AWS Batch, and AWS EFS

Published in

The Startup

7 min readDec 3, 2020

I recently came across an issue where I had to stand up a scalable compute pipeline and make it easy to deploy for others to stand up themselves.

For a myriad of reasons (I won’t give you the story of the long road filled with tech hardships or any of that), I went with Terraform for my declarative infrastructure, AWS Batch for my compute, and AWS EFS as my storage. However, it was pretty annoying to get everything together because there’s not a ton of documentation on how to make it all work. So… eggs.

You know how I’m filled with rage? I have no outlet for it so… eggs.

I’m going to describe how everything works here, but if you just want the code snippets (I know that’s all I ever care about) skip down to the Terraform section.

Basically, the Terraform scripts below (which I’m going to assume you know how to run, but if not, check out their docs) will stand up the AWS resources for you to have an Elastic Filesystem (EFS) mounted in your AWS Batch job so that you can write common resources to that filesystem (maybe a large dataset that you want to read from often) and have them persist and be accessible from all future jobs. This lets you not have to pull this large dataset from S3 with every job, and instead instantaneously mount the filesystem to read directly from that dataset.

Let’s imagine a scenario: You have a large dataset. Like, HUGE. You pull in or download that large dataset the first time your job runs and write it to your mounted EFS at /mnt/efs/big.data . Then as part of that same job, you search through big.data for some matching patterns or push them through some pipeline with parameters as set in your job at runtime. You complete the computation part, write some output, which you then push to an S3 bucket for results.

Now, you want to run the same pipeline but with slightly different parameters. Actually, you want to run it 100 more times all at the same time, each with slightly different parameters but still all reading from the same dataset. Instead of downloading this large dataset either from the internet or pulling it in from S3 every time, your jobs will automatically mount the filesystem in which big.data exists. EFS handles file-locking and concurrent reads so you don’t have to worry about data access management, and your jobs don’t have the hour-long overhead of downloading the dataset every time.

The idea is cool, but the documentation is sparse, so… Let’s get down to it!

Terraform

Here’s what you’re going to need.

Let’s start with security, which is basically all of the IAM stuff we need to let things talk to each other.

Storm Trooper keeping guard. Photo by Liam Tucker on Unsplash

At the very least you’ll need a VPC, the subnets associated, some roles, policies and their corresponding attachments, and profiles. I’ve given some examples below. As a warning, these examples are pretty lax, and you should lock these down further if you have security concerns.

Default Setup Stuff

# Security Setup
# Retrieves the default vpc for this region
data "aws_vpc" "default" {
  default = true
}# Retrieves the subnet ids in the default vpc
data "aws_subnet_ids" "all_default_subnets" {
  vpc_id = data.aws_vpc.default.id
}

Batch-Related Security Resources

# IAM Role for batch processing
resource "aws_iam_role" "batch_role" {
  name               = "batch_role"
  assume_role_policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement":
    [
      {
          "Action": "sts:AssumeRole",
          "Effect": "Allow",
          "Principal": {
            "Service": "batch.amazonaws.com"
          }
      }
    ]
}
EOFtags = {
    created-by = "terraform"
  }
}# Attach the Batch policy to the Batch role
resource "aws_iam_role_policy_attachment" "policy_attachment" {
  role       = aws_iam_role.batch_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}# Security Group for batch processing
resource "aws_security_group" "batch_security_group" {
  name        = "batch_security_group"
  description = "AWS Batch Security Group for batch jobs"
  vpc_id      = data.aws_vpc.default.idegress {
    from_port   = 0
    to_port     = 65535
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }tags = {
    created-by = "terraform"
  }
}

EC2 IAM Resources. Let’s give the underlying EC2 instances (that will be spun up and used by Batch) the ability to assume IAM roles.

# IAM Role for underlying EC2 instances
resource "aws_iam_role" "ec2_role" {
  name = "ec2_role"assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOFtags = {
    created-by = "terraform"
  }
}# Assign the EC2 role to the EC2 profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2_profile"
  role = aws_iam_role.ec2_role.name
}# Attach the EC2 container service policy to the EC2 role
resource "aws_iam_role_policy_attachment" "ec2_policy_attachment" {
  role       = aws_iam_role.ec2_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}

Batch Job IAM Resources. The jobs themselves can have an even more nuanced IAM role, so let’s let them write to S3 buckets for any outputs of the computation we have (though this will not be covered in-depth here).

# IAM Role for jobs
resource "aws_iam_role" "job_role" {
  name               = "job_role"
  assume_role_policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement":
    [
      {
          "Action": "sts:AssumeRole",
          "Effect": "Allow",
          "Principal": {
            "Service": "ecs-tasks.amazonaws.com"
          }
      }
    ]
}
EOFtags = {
    created-by = "terraform"
  }
}# S3 read/write policy
resource "aws_iam_policy" "s3_policy" {
  name   = "s3_policy"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "s3:Get*",
            "s3:List*",
            "s3:Put*"
        ],
        "Resource": [
          "${aws_s3_bucket.results_s3.arn}",
          "${aws_s3_bucket.results_s3.arn}/*"
        ]
    }
  ]
}
EOF
}# Attach the policy to the job role
resource "aws_iam_role_policy_attachment" "job_policy_attachment" {
  role       = aws_iam_role.job_role.name
  policy_arn = aws_iam_policy.s3_policy.arn
}

EFS IAM Resources. And finally, a security group for our EFS. Let’s let anything from the batch security group from earlier talk on the NFS port (2049) to anything in this security group.

resource "aws_security_group" "efs_security_group" {
  name        = "efs_security_group"
  description = "Allow NFS traffic."
  vpc_id      = data.aws_vpc.default.idlifecycle {
    create_before_destroy = true
  }ingress {
    from_port       = "2049"
    to_port         = "2049"
    protocol        = "tcp"
    security_groups = [aws_security_group.batch_security_group.id]
  }egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "No Outbound Restrictions"
  }
}

Pumpkin… Batch. Get it? Photo by Freddie Collins on Unsplash

Now for our Batch resources. We’re going to have an EFS, a corresponding mount target, a launch template with a template file, a compute environment, a job queue, and a job definition. Strap in, here comes another wall of Terraform resources.

EFS Resources

# EFS for sharing protein databases
resource "aws_efs_file_system" "efs" {
  creation_token   = "efs"
  performance_mode = "generalPurpose"
  encrypted        = "true"
}resource "aws_efs_mount_target" "efs_mount_target" {
  count          = length(data.aws_subnet_ids.all_default_subnets.ids)
  file_system_id = aws_efs_file_system.efs.id
  subnet_id      = element(tolist(data.aws_subnet_ids.all_default_subnets.ids), count.index)
  security_groups = [
    aws_security_group.efs_security_group.id,
    aws_security_group.batch_security_group.id
  ]
}resource "aws_launch_template" "launch_template" {
  name = "launch_template"update_default_version = true
  user_data              = base64encode(data.template_file.efs_template_file.rendered)
}data "template_file" "efs_template_file" {
  template = file("${path.module}/launch_template_user_data.tpl")
  vars = {
    efs_id        = aws_efs_file_system.efs.id
    efs_directory = "/mnt/efs"
  }
}

launch_template_user_data.tpl

So what’s in that `launch_template_user_data.tpl` anyway? This is kind of a magic script that gets run by EC2 on launch, if it’s launched from our launch template (which we will declare below that it should be for our Batch jobs). You can read more about it here: https://aws.amazon.com/premiumsupport/knowledge-center/batch-mount-efs/

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
#cloud-boothook
#!/bin/bash
cloud-init-per once docker_options echo 'OPTIONS="$${OPTIONS} --storage-opt dm.basesize=20G"' >> /etc/sysconfig/docker--==BOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"packages:
- amazon-efs-utilsruncmd:
- mkdir -p ${efs_directory}
- echo "${efs_id}:/ ${efs_directory} efs _netdev,tls,iam 0 0" >> /etc/fstab
- mount -a -t efs defaults--==BOUNDARY==--

Batch Resources. Everything from the compute environment to the actual job definitions. The specified image in the job definition below is something you will need to put into ECR yourself.

resource "aws_batch_compute_environment" "batch_environment" {
  compute_environment_name = "batch-environment"
  compute_resources {
    instance_role = aws_iam_instance_profile.ec2_profile.arn
    launch_template {
      launch_template_name = aws_launch_template.launch_template.name
      version              = "$Latest"
    }
    instance_type = [
      "optimal"
    ]
    max_vcpus = 2
    min_vcpus = 0
    security_group_ids = [
      aws_security_group.batch_security_group.id,
      aws_security_group.efs_security_group.id
    ]
    subnets = data.aws_subnet_ids.all_default_subnets.ids
    type    = "EC2"
  }
  service_role = aws_iam_role.batch_role.arn
  type         = "MANAGED"tags = {
    created-by = "terraform"
  }
}resource "aws_batch_job_queue" "job_queue" {
  name     = "job_queue"
  state    = "ENABLED"
  priority = 1
  compute_environments = [
    aws_batch_compute_environment.batch_environment.arn
  ]
  depends_on = [aws_batch_compute_environment.batch_environment]tags = {
    created-by = "terraform"
  }
}resource "aws_batch_job_definition" "job" {
  name = "job"
  type = "container"
  parameters = {}
  container_properties = <<CONTAINER_PROPERTIES
{
  "image": "${aws_ecr_repository.ecr.repository_url}",
  "jobRoleArn": "${aws_iam_role.job_role.arn}",
  "vcpus": 2,
  "memory": 1024,
  "environment": [],
  "volumes": [
      {
          "host": {
              "sourcePath": "/mnt/efs"
          },
          "name": "efs"
      }
  ],
  "mountPoints": [
      {
          "containerPath": "/mnt/efs",
          "sourceVolume": "efs",
          "readOnly": false
      }
  ],
  "command": []
}
CONTAINER_PROPERTIEStags = {
    created-by = "terraform"
  }
}

And… well, that should be it!

Please let me know if you have any questions, I know I had a ton and wished someone were available to answer them. I may be slow to respond, but if you write out a comment with a specific question, I will try to answer it!

Thanks!

Terraform, AWS Batch, and AWS EFS

Terraform

Written by Joe Min