Image for post
Image for post

AWS S3 Batch Operations: Beginner’s Guide

If you’ve ever tried to run operations on a large number of objects in S3, you might have encountered a few hurdles. Listing all files and running the operation on each object can get complicated and time consuming as the number of objects scales up. Many decisions have to be made: is running the operations from my personal computer fast enough? Or should I run it from a server that’s closer to the AWS resources, benefiting from AWS’s fast internal network? If so, I’ll have to provision resources (e.g. ec2 instance, lambda functions, containers, etc) to run the job.

Thankfully, AWS has heard our pains and announced AWS S3 Batch Operations preview during the last AWS Reinvent conference. This new service (which you can access by asking AWS politely) allows you to easily run operations on very large numbers of S3 objects in your bucket. Curious to know how it works? Let’s get going.

If you don’t have access to S3 batch operations preview, fill in the form in this page. It took a couple of days before I got an answer from AWS, so arm yourself with patience.

Getting Started

Now that you have access to the preview, you can find the Batch Operations tab from the side of the S3 console:

Image for post
Image for post
Access Batch operations from the S3 console

Once you have reached the Batch operations console, let’s talk briefly about jobs.

Central to S3 Batch Operations is the concept of Job. In a nutshell, a Job determines:

  • In which buckets your objects are located
  • What operation to do on the objects
  • Which objects to run the operations on

We’ll soon create our first job. But first, let’s create a test bucket, just to experiment a little with Batch Operations.

Before you create your first job, create a new bucket with a few objects. I created a new S3 bucket named “spgingras-batch-test” in which I uploaded 3 files (file1.jpg, file2.jpg, file3.jpg):

Image for post
Image for post
Contents of my bucket

I know, it’s quite small, but for demonstration purposes it’s going to be just fine.

Next you’ll need to create a CSV file that contains 2 colums (bucket name, object name) for each object you want the job to operate on. In my case, I want the job to operate on all 3 files, so my CSV file looks like this:

Image for post
Image for post
Contents of the manifest file

Now, save the CSV and upload it inside your bucket: I named the file “manifest.csv”:

Image for post
Image for post
manifest.csv is now in my bucket

Before we can create our first jobs, we must create a IAM role that Batch Operations can assume. This role will allow Batch Operations to read your bucket and modify the objects in it.

Here, I’m assuming you are familiar with creating IAM roles. I won’t give screenshots for all steps required to create the IAM role.

From the IAM console, create a new IAM role. Choose any service to use the role (it’s not important, as we’ll soon overwrite the trust policy for this role):

Image for post
Image for post
Choose any service. Here, I chose EC2, but it can be any other (Lambda, S3, etc).

Don’t choose any specific permissions for this role yet. Once the role is created, update the role’s Trust Relationship to:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "batchoperations.s3.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

For permissions, create a new inline policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObjectTagging",
"s3:PutObjectVersionTagging"
],
"Resource": "arn:aws:s3:::spgingras-batch-test/*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::spgingras-batch-test/manifest.csv"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::spgingras-batch-test/*"
]
}
]
}

Be sure to replace “spgingras-batch-test” with your own bucket’s name. Now save the policy, and the IAM role is now ready to be used. We’re now set to create our first job.

Creating our First Job

What we’ll want to Batch Operations to help us with is add a tag to every object in the bucket. If you have ever done this before, you’ll know that it can be a pain in the butt to update tags on millions of S3 objects. Thankfully, it can be done in a pinch using Batch Operations.

From the Batch Operations console, click the “Create Job” button:

Image for post
Image for post
Go ahead, just click it

In the first step, choose “CSV” (1) as the Manifest format. Also, enter the path to your manifest file (2) (mine is s3://spgingras-batch-test/manifest.csv):

Image for post
Image for post
The first screen

Then, click “Next”. On the second screen you will decide what operation to run on the S3 objects. Choose the “Replace all tags” (1), and add new tags to the list (2). I chose to add the “type” and “environment” tags, but you can choose anything you want:

Image for post
Image for post
Here, you decide which tags to apply to the S3 objects

Note that this will replace all tags on all objects in the manifest. Also, it’s pretty cool that at some point in the future, you’ll be able to invoke Lambda functions on your S3 objects! Once you’re done, click “Next”.

On the following screen, you will have to choose the IAM role you have created previously. Remember, this role will be used by Batch Operations to play with your bucket. For this example, I have named the IAM role simply “batch-role”. Uncheck the “Generate completion report” (1) (you don’t need that for the demo) and pick the IAM role from the dropdown (2):

Image for post
Image for post
Uncheck “Generate completion report” and select the previously created IAM role

Now, click “Next”. On the following screen, review the details to make sure everything is OK, and click “Create job”. The job is now created, and we can run it.

Running Your Job

Now that the job is created, it’s time to run it. From the Batch Operations console, click on the Job’s ID:

Image for post
Image for post
Find your job in the Batch operations console and click on the job’s ID

In the job’s description screen, click on the “Confirm and run” button:

Image for post
Image for post
Hit that button to start the magic

And in the next screen, confirm the details and click “Run job”. Now, go back to the Batch Operations console. Wait until your job’s status (1) is “Complete”. Spam that refresh button (2) if needed:

Image for post
Image for post
Refresh your job’s status until it's marked as Complete

Now that the job is completed, go back to your bucket. Open one of the object’s Properties pane:

Image for post
Image for post
Tags are all set!

You’ll notice that all tags of the object have been updated. The same will be true for every other object you included in the manifest file

Wrap Up

Using S3 Batch Operations, it’s now pretty easy to modify S3 objects at scale. Simply select files you want to act on in a manifest, create a job and run it. No servers to create, no scaling to manage. With this new feature of S3, here are some ideas of tasks you could run:

  • copy S3 objects in bulk from one bucket to another
  • send media files to Elastic Transcoder
  • retroactively update tags on old S3 objects

poka-techblog

Poka Tech Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store