Recurring data delivery and ingestion with S3 bucket replication

Simplify data delivery and ingestion and set it up in 5 minutes with AWS CDK

Angely Philip
YipitData Engineering
3 min readJun 1, 2021

--

Photo by Franki Chamaki on Unsplash

What is bucket replication?

S3 can move data automatically from one bucket to another. This feature is called bucket replication. Once bucket replication is configured, files will automatically be copied into the destination bucket within 15 minutes. 🚀

What are the benefits?

Reliable and fast data delivery processes. Syncing data between buckets is entirely managed by AWS. Error-prone scripts that run on a schedule and manual syncing processes are eliminated.

Built-in auditing and monitoring. S3 publishes a replication notification to keep track of exactly which files were copied over and when, in addition to CloudWatch metrics to track data volume.

Granular control of data being copied. Support for sending copies of data under a specific prefix to one or more buckets.

What problems does it solve?

Simplifies data distribution between one or many AWS accounts. Replication supports many-to-many relationships, regardless of AWS account or region. For example, you could have one bucket with several replication rules copying data over to several destination buckets.

Eliminates object-level permission issues. S3 gives the destination bucket full ownership over the data.

Improves data security posture. The replication process uses role-based access to replicate data, removing the risk of managing IAM Access Keys.

What are the limitations?

Difficult to sync existing data. By default bucket replication applies to newly written data once enabled. The easiest way to get a copy of the existing data in the bucket is by running the traditional aws s3 sync command. Syncing existing data can be managed by the S3 team by contacting AWS support, but this can take weeks.

Does not integrate with other cloud providers. It is challenging to rely solely on bucket replication for data ingestion or delivery when working with non-AWS cloud providers.

Not available in mainland China regions. Replicating data from mainland China to another region will not work.

Custom IAM role for advanced setups. Since bucket replication supports copying over object-level tags and KMS encrypted objects, the IAM role used with this feature needs to be customized to have sufficient access.

Buckets need to be versioned. Bucket replication will not work unless the bucket is versioned. Over time, having multiple versions of objects could lead to unexpected costs.

How to set it up with AWS CDK

CDK codifies AWS resources and provides an interface to generate and deploy these resources into an AWS account.

We standardize our infrastructure using custom constructs that are fit for our business use-cases. In this case, we set up a construct to implement an S3 bucket with replication.

This construct can be leveraged while setting up bucket replication when you need to transfer data to another bucket, or when you want to allow data to be transferred into your bucket.

Create a bucket that copies over data to another bucket using CDK.
Create a bucket that allows another bucket to replicate data into it using CDK.

Here are the full details of how we implemented the construct.

CDK construct that standardizes bucket replication setup.

Bucket replication’s impact at YipitData

Reduced processing time and costs of data ingestion pipelines because new data lands in our bucket as soon as it is written by the upstream service. No more nightly cronjobs running aws s3 sync 😃.

Allows us to work with new data as it’s available by dynamically starting transformations as soon as new data arrives. Pipelines are not concerned with loading the data in our lake, and instead focused on shaping the data as it lands. This improves the velocity at which we can derive insights.

Adopted the technology at a fast pace by configuring buckets using AWS CDK.

Huge thanks to Bobby Muldoon, Jim Shields, Anup Segu, Annie Holladay and Hugo Lopes Tavares for their thoughtful reviews.

Lastly, we are hiring! Check out our open roles.

--

--