Syncing S3 buckets from different providers/endpoints (part 2)

3 min readNov 13, 2019

We investigated several methods before on how to sync S3 buckets between different endpoints and my personal preference was the native AWS Cli tools. But how can we actually apply this? It is investigation time!

Poking the brain a bit

I have created a folder in /tmp named ‘temp-bucket’. (mkdir /tmp/temp-bucket). In this folder we will be syncing the contents of the S3 bucket to our local filesystem and then sync the local filesystem to the mocked S3 bucket.

$ aws s3 sync s3://live-bucket .
$ s3local sync . s3://mocked

This seems to work fairly well, but how should we go about doing a full sync? We know that we can list all the buckets with the AWS Cli tools but how can we sync all the buckets? I am thinking about the following flow. (There is a text based version below the image.)

Flow of the sync (text based version below)

S3 to local filesystem

List all buckets: invoke the AWS Cli tools command (aws s3api list-buckets)
Local folder: create a folder per bucket fro the bucket list in /tmp (we might end up using an UUID for temp folders too eg /tmp/$UUID/$BUCKET)
Sync: sync the contents of the S3 bucket to the local folder in /tmp (aws s3 sync s3://$BUCKET $BUCKET)

Local filesystem to S3

List all folders: list all the folders in /tmp
Force param: accept a parameter (bool) that determines if a bucket should be created if it does not exist
Sync to S3: Sync the local FS folder to the S3 bucket (aws s3 sync $BUCKET s3://$BUCKET)

Scripting time!

Now that we have a basic flow and idea on how we want to solve this issue we need to start scripting. As I am way more ‘fluent’ with UNIX based systems than with Windows (in terms of scripting) I will be using bash. So how do we make a ‘for loop’ for every S3 bucket?

As we can see, if we list the buckets we get a JSON object back. This will be interesting to work with (as I have never done this)…

$ aws --endpoint=http://192.168.99.100:4572 s3api list-buckets
{
    "Buckets": [
        {
            "Name": "mocked",
            "CreationDate": "2006-02-03T16:45:09.000Z"
        },
        {
            "Name": "mybucket",
            "CreationDate": "2006-02-03T16:45:09.000Z"
        }
    ],
    "Owner": {
        "DisplayName": "webfile",
        "ID": "bcaf1ffd86f41161ca5fb16fd081034f"
    }
}

However I found another way to list all the S3 buckets. Add this looks a lot more fun and workable for me than JSON in bash. (Note: this is just a personal preference)

$ aws --endpoint=http://192.168.99.100:4572 s3 ls
2006-02-03 17:45:09 mocked
2006-02-03 17:45:09 mybucket

Next up we need to extract the names from the output (with regex, but this time from the live environment and not the locally mocked environment), create a folder for each bucket and sync it with the aws cli tools. Sounds pretty easy to do!

Before we start the script, we need to make some sort of UUID to make an unique-ish folder under /tmp. But how would we achieve this under bash? Apparantly it is rather easy to do, while digging the Linux kernel I found something that is rather useful for us here — it is something that generates UUIDs for us :o! We can simply invoke it by calling cat /proc/sys/kernel/random/uuid

The regex for this is rather simple — ‘\s(\w+)$’, with this regex we are matching the last word on every line (in this case mocked and mybucket). Now we can use commands such as mkdir to create the folders locally, then use aws sync (live to local). Next up we list all the folders with ls (hint: we can use -d */ as parameters to only list directories), loop over them and call aws sync (local to mocked) and we are done!

Syncing S3 buckets from different providers/endpoints (part 2)

Poking the brain a bit

Scripting time!

Written by Bo Bracquez