If you’ve ever tried to run operations on a large number of objects in S3, you might have encountered a few hurdles. Listing all files and running the operation on each object can get complicated and time consuming as the number of objects scales up. Many decisions have to be made: is running the operations from my personal computer fast enough? Or should I run it from a server that’s closer to the AWS resources, benefiting from AWS’s fast internal network? If so, I’ll have to provision resources (e.g. ec2 instance, lambda functions, containers, etc) to run the job.
Thankfully, AWS has heard our pains and announced AWS S3 Batch Operations preview during the last AWS Reinvent conference. This new service (which you can access by asking AWS politely) allows you to easily run operations on very large numbers of S3 objects in your bucket. Curious to know how it works? Let’s get going.
Accessing the Preview
If you don’t have access to S3 batch operations preview, fill in the form in this page. It took a couple of days before I got an answer from AWS, so arm yourself with patience.
Now that you have access to the preview, you can find the Batch Operations tab from the side of the S3 console:
Once you have reached the Batch operations console, let’s talk briefly about jobs.
Central to S3 Batch Operations is the concept of Job. In a nutshell, a Job determines:
- In which buckets your objects are located
- What operation to do on the objects
- Which objects to run the operations on
We’ll soon create our first job. But first, let’s create a test bucket, just to experiment a little with Batch Operations.
Creating the Test Bucket
Before you create your first job, create a new bucket with a few objects. I created a new S3 bucket named “spgingras-batch-test” in which I uploaded 3 files (file1.jpg, file2.jpg, file3.jpg):
I know, it’s quite small, but for demonstration purposes it’s going to be just fine.
Next you’ll need to create a CSV file that contains 2 colums (bucket name, object name) for each object you want the job to operate on. In my case, I want the job to operate on all 3 files, so my CSV file looks like this:
Now, save the CSV and upload it inside your bucket: I named the file “manifest.csv”:
Before we can create our first jobs, we must create a IAM role that Batch Operations can assume. This role will allow Batch Operations to read your bucket and modify the objects in it.
Creating the IAM Role for Batch Operations
Here, I’m assuming you are familiar with creating IAM roles. I won’t give screenshots for all steps required to create the IAM role.
From the IAM console, create a new IAM role. Choose any service to use the role (it’s not important, as we’ll soon overwrite the trust policy for this role):
Don’t choose any specific permissions for this role yet. Once the role is created, update the role’s Trust Relationship to:
For permissions, create a new inline policy:
Be sure to replace “spgingras-batch-test” with your own bucket’s name. Now save the policy, and the IAM role is now ready to be used. We’re now set to create our first job.
Creating our First Job
What we’ll want to Batch Operations to help us with is add a tag to every object in the bucket. If you have ever done this before, you’ll know that it can be a pain in the butt to update tags on millions of S3 objects. Thankfully, it can be done in a pinch using Batch Operations.
From the Batch Operations console, click the “Create Job” button:
In the first step, choose “CSV” (1) as the Manifest format. Also, enter the path to your manifest file (2) (mine is s3://spgingras-batch-test/manifest.csv):
Then, click “Next”. On the second screen you will decide what operation to run on the S3 objects. Choose the “Replace all tags” (1), and add new tags to the list (2). I chose to add the “type” and “environment” tags, but you can choose anything you want:
Note that this will replace all tags on all objects in the manifest. Also, it’s pretty cool that at some point in the future, you’ll be able to invoke Lambda functions on your S3 objects! Once you’re done, click “Next”.
On the following screen, you will have to choose the IAM role you have created previously. Remember, this role will be used by Batch Operations to play with your bucket. For this example, I have named the IAM role simply “batch-role”. Uncheck the “Generate completion report” (1) (you don’t need that for the demo) and pick the IAM role from the dropdown (2):
Now, click “Next”. On the following screen, review the details to make sure everything is OK, and click “Create job”. The job is now created, and we can run it.
Running Your Job
Now that the job is created, it’s time to run it. From the Batch Operations console, click on the Job’s ID:
In the job’s description screen, click on the “Confirm and run” button:
And in the next screen, confirm the details and click “Run job”. Now, go back to the Batch Operations console. Wait until your job’s status (1) is “Complete”. Spam that refresh button (2) if needed:
Now that the job is completed, go back to your bucket. Open one of the object’s Properties pane:
You’ll notice that all tags of the object have been updated. The same will be true for every other object you included in the manifest file
Using S3 Batch Operations, it’s now pretty easy to modify S3 objects at scale. Simply select files you want to act on in a manifest, create a job and run it. No servers to create, no scaling to manage. With this new feature of S3, here are some ideas of tasks you could run:
- copy S3 objects in bulk from one bucket to another
- send media files to Elastic Transcoder
- retroactively update tags on old S3 objects