Batch Operation — S3
Perform operations on large-scale S3 objects
Credits for SDK testing: Parikshit Maheshwari
While deleting with S3 as datalake, many times, we have to perform certain operations which are bulk in nature. For example, once we drop a table partition in the hive we apply a deletion policy tag such that all objects in that partition get deleted automatically from S3 depending upon the tag value. Tagging such a huge number of objects one by one is time-consuming. If we are using lambda it will even timeout. To avoid such cases, we can use the S3 batch operation.
The architecture of a batch operation is as follows:
There are three components of a batch operator:
Job: This is the basic unit of work in batch operators. It contains all of the information necessary to run the specified operation on a list of objects.
Operation: It is the API action, such as replacing tags. Job & operation has one-to-one mapping.
Task: It task is a unit of execution for a job. The job will create a single task for each object specified in the manifest file.
Creating a Batch Job:
Prerequisite:
Manifest file: We can create manifest in CSV format or in from S3 inventory. In this example, I will create it using CSV. Below is the content of my manifest file.
<bucket-name>,path/to/file/5849_scheduler.logs
<bucket-name>,path/to/file/5849_webserver.logs
<bucket-name>,path/to/file/5898-scheduler.logs
Note: This manifest file does not support regex or wildcard, object keys should be a full path. We can leverage AWS S3 inventory to get the files list.
IAM Role with access to all the bucket locations where the manifests, actual objects, and report will reside. I created a role with the below policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObjectTagging",
"s3:PutObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::<bucket_name>/path/to/data/*",
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<bucket_name>/path/to/manifest.csv"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<bucket_name>/path/to/report/*"
]
}
]
}
Also, we add a trust policy for this role so that s3 can assume this role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "batchoperations.s3.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Now we create the S3 batch job using the console. Remember that the role/identity which is creating this batch Job operator should have s3:CreateJob permission and iam:PassRole permission on the role we created above.
I selected the reason CSV as manifest type & added the manifest file path. Then I click next.
Under the operation, I chose “Replace all object Tags” and provided the TAG key-value pair.
Scroll down & fill in the Permission details
Once filled clicked on the next button. On the next page review and then click on Create Job. This will create the Job and will go in the “Awaiting your confirmation to run” state. Select the job and click on Run Job. This will run the job. It will say the completion %. Once done will display the following screen.
That's it. We can click on the job id and see its details as well as report location. In this blog, we saw how we can create a batch job for S3 operation. Although we did everything from the console, we can do it from CLI/SDK as well. While using SDK/CLI we need to have ETag of the manifest file, AWS account id, bucket ARN & we need to pass Report Format (S3BatchOperations_CSV_20180820) & ConfirmationRequired to False. The last param will allow the job to run immediately.
Batch operations take their own sweet time. The jobs go through the following stages before its complete.
Preparing -> suspended -> Ready -> Active -> Completing -> Complete
Other possible states are Failed | Cancelling | Cancelled
.
We might want an event-driven way to get notified once the job is complete. Thanks to Eventbridge, we can implement the same using rules. We filter the Status Change event from CloudTrail and notify the stakeholder. The filter pattern for the same will be as below:
{
"source": [
"aws.s3"
],
"detail-type": [
"AWS Service Event via CloudTrail"
],
"detail": {
"eventSource": [
"s3.amazonaws.com"
],
"eventName": [
"JobStatusChanged"
],
"serviceEventDetails": {
"status": [
"Complete"
]
}
}
}
We can plug SNS or Lambda to get the event message.
Happy cloud computing!!