AWS Batch to process S3 events

Published in

The Startup

7 min readDec 22, 2019

I’m a huge fan of AWS Lambda functions, but they have a current limitation to only execute a process for a maximum of 15 minutes before timing out.

So what if you need a job to run longer than that?

A common use case, for example, is unloading data from Amazon Redshift into an S3 Bucket and having to run a lengthy Redshift Spectrum SQL query afterward for BI reporting. If you’re dealing with Big Data, the query can certainly take longer than 15 minutes. In this article, we’ll upload a sample file to an S3 Bucket and S3 will publish the event to AWS Lambda and invoke our function. Because AWS Batch cannot receive S3 notifications directly, the Lambda function will trigger our AWS Batch job to do the work. The source is located in GitHub.

AWS Batch is a great solution for long-running workloads in AWS because it handles all the compute resource management for us; we don’t have to spin up our own EC2 instances. Furthermore, AWS now allows for by-the-second billing so we are only charged for what we actually use. This used to be by-the-hour.

How does AWS batch work?

You simply define and submit your batch jobs to a queue. In response, AWS Batch chooses where to run the jobs, launching additional AWS capacity if needed. AWS Batch carefully monitors the progress of your jobs. When capacity is no longer needed and your job is finished, AWS Batch will remove it and tear down the infrastructure it spun up to execute your job(s).

1. Create the AWS Batch job

For this tutorial, the AWS Batch job will be a simple Node.js runtime inside a Docker container. We’ll use Node.js to grab the file from S3 and do stuff with it.

Prerequisite: You’ll need Docker installed locally in order to build the AWS Batch Job container and push it up into Amazon ECR.

The Lambda function we’ll write later will invoke the AWS Batch Job and pass in the S3 Bucket and S3 Object Key (the file uploaded) as environmental variables.

AWS Batch Job Node.js application:

const AWS = require('aws-sdk');
const s3 = new AWS.S3();const myBucket = process.env.MY_BUCKET;
const myKey = process.env.MY_KEY;
const params = {Bucket: myBucket, Key: myKey};s3.getObject(params, function(err, data) {
    if (err) {
        console.log(err)
    } else {
        console.log("Successfully retrieved object");
        // DO STUFF with your uploaded file
    }
});

We’ll now write the Dockerfile that’ll be the blueprint for our Node.js container to be used by AWS Batch.

Dockerfile:

# Base Node.js image
FROM node:boron

# Create AWS Batch Job Work Directory
WORKDIR /usr/src/app

# Copy Package.json file
COPY package.json /usr/src/app

# Install AWS Batch Job Dependencies
RUN npm install

# Bundle AWS Batch Job Source
COPY . /usr/src/app

# Run AWS Batch Job
CMD [ "node", "index" ]

Prerequisite: In order to push it up to ECR, I assume you already have the AWS CLI installed on your local machine and configured with your AWS account.

2. Create the ECR repository

Amazon EC2 Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. We use an ECR repository to store our Docker container for AWS Batch to pull from.

We’ll first create the ECR repository using CloudFormation. If AWS supports it, I try and implement all infrastructure through CloudFormation. This follows the Infrastructure as Code paradigm and allows versioning on your infrastructure when storing templates in version control.

MyBatchRepo:
  Type: "AWS::ECR::Repository"
  Properties: 
    RepositoryName: "my-batch-job"

The next step is to build our Node.js Docker container and push it up to ECR. We’ll create a simple Bash script (remember to replace your AWS account Id):

#!/bin/bashACCOUNT=123456789012 # REMEMBER TO REPLACE THE AWS ACCOUNT ID
DOCKER_CONTAINER=my-batch-job
REPO=${ACCOUNT}.dkr.ecr.us-west-2.amazonaws.com/${$DOCKER_CONTAINER}
TAG=build-$(date -u "+%Y-%m-%d")echo "Building Docker Image..."
docker build -t $DOCKER_CONTAINER .echo "Authenticating against AWS ECR..."
eval $(aws ecr get-login --no-include-email --region us-west-2)echo "Tagging ${REPO}..."
docker tag $DOCKER_CONTAINER:latest $REPO:$TAG
docker tag $DOCKER_CONTAINER:latest $REPO:latestecho "Deploying to AWS ECR"
docker push $REPO

The next step is to make the script an executable:

$ chmod +x deploy-docker.sh

And now we can execute it:

$ ./deploy-docker.sh# Building Docker Image...
# Authenticating against AWS ECR...
# Tagging Repo...
# Deploying to AWS ECR...

After pushing the Docker container up to ECR, we’ll now create the AWS Batch Job environment using CloudFormation, too.

Prerequisite: I assume you have already created a Service Role for AWS Batch. If you spin up your first AWS Batch environment in the AWS Web Console, AWS will create this Service Role for you.

We start by defining our Compute Environment:

ComputeEnvironment:
  Type: "AWS::Batch::ComputeEnvironment"
  Properties:
    Type: "MANAGED"
    ServiceRole: !Sub "arn:aws:iam::${AWS::AccountId}:role/service-role/AWSBatchServiceRole"
    ComputeEnvironmentName: !Sub "${Environment}-batch-environment"
    ComputeResources:
      MaxvCpus: 12
      SecurityGroupIds:
        - !Ref "SecurityGroup"
      Type: "EC2"
      Subnets: !Ref "Subnets"
      MinvCpus: 0
      InstanceRole: !Ref "ECSInstanceProfile"
      InstanceTypes:
        - "m3.medium"
        - "m3.large"
        - "m3.xlarge"
        - "m4.large"
        - "m4.xlarge"
      Tags: {"Name": !Sub "${Environment} - Batch Instance"}
      DesiredvCpus: 2
    State: "ENABLED"

And then we create the Job Queue:

JobQueue:
  DependsOn: ComputeEnvironment
  Type: "AWS::Batch::JobQueue"
  Properties:
    ComputeEnvironmentOrder:
      - Order: 1
        ComputeEnvironment: !Sub "${Environment}-batch-environment"
    State: "ENABLED"
    Priority: 1
    JobQueueName: "HighPriority"

And then we define the Job:

Job:
  Type: "AWS::Batch::JobDefinition"
  Properties:
    Type: "container"
    JobDefinitionName: !Sub "${Environment}-batch-job"
    ContainerProperties: 
      Memory: 1024
      Privileged: false
      JobRoleArn: !Ref "JobRole"
      ReadonlyRootFilesystem: true
      Vcpus: 1
      Image: !Sub "${AWS::AccountId}.dkr.ecr.us-west-2.amazonaws.com/${DockerImage}"
    RetryStrategy: 
      Attempts: 0

And the IAM role for the Job:

JobRole:
  Type: "AWS::IAM::Role"
  Properties:
    RoleName: !Sub "${Environment}-BatchJobRole"
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        -
          Action: "sts:AssumeRole"
          Effect: "Allow"
          Principal:
            Service: "ecs-tasks.amazonaws.com"
    Policies:
      - PolicyName: !Sub "${Environment}-s3-access"
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Action: "s3:getObject"
              Resource: !Sub "arn:aws:s3:batch-${AWS::AccountId}-${AWS::Region}/uploads/*"

Afterward, we then create the ECS Instance Profile and IAM Role:

ECSInstanceProfile:
  Type: "AWS::IAM::InstanceProfile"
  Properties:
    Path: "/"
    Roles:
      - !Ref "ECSRole"ECSRole:
  Type: "AWS::IAM::Role"
  Properties:
    Path: "/"
    RoleName: !Sub "${Environment}-batch-ecs-role"
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        -
          Action: "sts:AssumeRole"
          Effect: "Allow"
          Principal:
            Service: "ec2.amazonaws.com"
    Policies:
      -
        PolicyName: !Sub '${Environment}-ecs-batch-policy'
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            -
              Effect: "Allow"
              Action:
                - "ecs:CreateCluster"
                - "ecs:DeregisterContainerInstance"
                - "ecs:DiscoverPollEndpoint"
                - "ecs:Poll"
                - "ecs:RegisterContainerInstance"
                - "ecs:StartTelemetrySession"
                - "ecs:StartTask"
                - "ecs:Submit*"
                - "logs:CreateLogStream"
                - "logs:PutLogEvents"
                - "logs:DescribeLogStreams"
                - "logs:CreateLogGroup"
                - "ecr:BatchCheckLayerAvailability"
                - "ecr:BatchGetImage"
                - "ecr:GetDownloadUrlForLayer"
                - "ecr:GetAuthorizationToken"
              Resource: "*"
      - PolicyName: !Sub "${Environment}-ecs-instance-policy"
        PolicyDocument:
          Statement:
            -
              Effect: "Allow"
              Action:
                - "ecs:DescribeContainerInstances"
                - "ecs:ListClusters"
                - "ecs:RegisterTaskDefinition"
              Resource: "*"
            -
              Effect: "Allow"
              Action:
                - "ecs:*"
              Resource: "*"

Lastly, we create the AWS Security group:

SecurityGroup:
  Type: "AWS::EC2::SecurityGroup"
  Properties:
    VpcId: !Ref "VpcID"
    GroupDescription: "Inbound security group for SSH on Batch EC2 instance"
    SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: "22"
        ToPort: "22"
        CidrIp: !Ref CIDR
    Tags:
      - Key: Name
        Value: !Sub "${Environment}-batch-sg"

Using the same CloudFormation template, we’ll create a Lambda function to trigger the job.

3. Create the Lambda to trigger the AWS Batch Job

LambdaFunction:
  Type: "AWS::Lambda::Function"
  Properties:
    Description: "A Lambda function that triggers AWS Batch for S3 processing"
    Handler: "index.handler"
    Runtime: "node6.10"
    Timeout: "30"
    Role: !GetAtt LambdaRole.Arn
    enviroment:
      JOB_DEFINITION: !Ref Job
      JOB_NAME: !Sub "${Environment}-batch-s3-processor"
      JOB_QUEUE: "HighPriority"
    Code:
      ZipFile: !Sub |
        const AWS = require('aws-sdk');
        const batch = new AWS.Batch({apiVersion: '2016-08-10'});
        
        exports.handler = function(event, context, callback) {          console.log('s3 object', event.Records[0].s3);          const params = {
            jobDefinition: process.env.JOB_DEFINITION,
            jobName: process.env.JOB_NAME,
            jobQueue: process.env.JOB_QUEUE,
            containerOverrides: {
              environment: [
                {
                  name: 'MY_BUCKET',
                  value: event.Records[0].s3.bucket.name
                },
                {
                  name: 'MY_KEY',
                  value: event.Records[0].s3.object.key
                }
              ]
            }
          };          batch.submitJob(params, function(err, data) {
            if (err) {
              console.log(err, err.stack);
              return callback(err);
            } else {
              console.log(data);
              return callback();
            }
          })
         }
    
  LambdaRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
        - Effect: "Allow"
          Principal:
            Service:
            - "lambda.amazonaws.com"
          Action:
          - "sts:AssumeRole"
      Path: "/"
      Policies:
      - PolicyName: !Sub "${Environment}-aws-batch-access"
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
              - batch:*
              Resource: "*"
      Path: "/"

4. Create the S3 Bucket

We’ll now create an S3 bucket inside the same CloudFormation template so we can upload our sample file. The bucket name has to be unique. For this example, I am appending the AWS account ID and Region to the bucket name using CloudFormation’s built-in Pseudo Parameters (e.g., “batch-123456789012-us-west-2”). If you plan to reuse this template without changes, the chances of the bucket name being taken are slim.

S3Bucket:
  Type: 'AWS::S3::Bucket'
  Properties:
    AccessControl: "Private"
    BucketName: !Sub "batch-${AWS::AccountId}-${AWS::Region}"

Configure the S3 Event Notification

The Amazon S3 notification feature enables you to receive notifications when certain events happen in your bucket. To enable notifications, you must first add a notification configuration identifying the events you want Amazon S3 to publish, and the destinations where you want Amazon S3 to send the event notification. Currently, S3 events can only push to three different types of destinations:

Amazon Simple Notification Service (SNS) topic
Amazon Simple Queue Service (SQS) Queue
AWS Lamba

In our example here, we will be using AWS Lambda as AWS Batch is not a supported destination. Our Lambda function will invoke AWS Batch for us whenever one of our configured S3 events occur.

Resources:
  S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      AccessControl: "Private"
      BucketName: !Sub "batch-${AWS::AccountId}-${AWS::Region}"
      NotificationConfiguration:
        LambdaConfigurations:
          -
            Function: !Ref Lambda
            Event: "s3:ObjectCreated:*"
            Filter:
              S3Key:
                Rules:
                  -
                    Name: prefix
                    Value: upload

And that’s it! Every time we upload a file into the upload directory of our S3 bucket, S3 will create an event notification and invoke our AWS Lambda. The only task the AWS Lambda function performs is triggering our AWS Batch job to enter a job queue. Currently, we don't do anything once the Job is running, but I'll leave that to your imagination.

Further Learning

Another approach instead of using AWS Batch that may fit your use case is AWS Glue, which is a new fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. This is particularly useful for analytics.

AWS Batch to process S3 events

Written by Justin Plute