Making Room for Zero-ETL in Your Data Engineering Toolbelt

The benefits of automating data pipeline maintenance

Rob Koch
Slalom Build
6 min readJun 11, 2024

--

Scenario—your data pipeline is fragile. It breaks every so often. Something changed, but what? It times out occasionally. Why is today’s run taking so long? Whether you are using ELT (extract, load, transform) or ETL (extract, transform, load) within a data-driven organization, these are issues that come up often.

What Is Zero-ETL?

We have a new kid on the block in the cloud data world: zero-ETL, a new data replication service from Amazon Web Services (AWS), and I want to figure out if it has staying power. In this article, I will walk through a typical data flow and identify the drawbacks of each step while introducing you to the benefits of using the new AWS zero-ETL service. Zero-ETL benefits those who have hundreds of data pipelines and want to simplify their processes, whittling the number down to a few or zero.

Starting things off with decomposing the service name itself—the ETL portion in the name zero-ETL stands for “extract, transform, load.” ETL is a typical data flow: getting it from the raw state and transforming it into a more polished form, while storing it in a way that is usable for consumption. The ETL pattern is often seen in relational database shops. Flipping the two letters L and T around, we have an ELT pattern that differs from ETL, where we extract the data, load it in a secondary storage, and then transform as needed for consumption—ELT is commonly used in data warehouse or data lake architectures. The “zero” portion is the magic of the service—we are eliminating pipelines that you would have to maintain. You set it up once and it is done.

The AWS flavor of zero-ETL (as of H1 2024) uses Aurora (MySQL or PostgreSQL) and DynamoDB as the source and Redshift Serverless as the destination. Using this tutorial, you can follow along to see what is involved implementing a zero-ETL integration. It makes the entire process feel more like zero-EL (not a misprint; no T), with no transformations included. Since there is no transforms capability [yet?], we can rely on Redshift’s massively parallel processing (MPP) to generate analytics of your data quickly, without too much ceremony. Regardless of whether zero-ETL should be named zero-ELT or zero-EL, let us see how a data flow (Collection > Augment > Transform > Load) would work.

The data gets extracted using a scraping system that is run on some proprietary platform. As the data gets processed and transformed into a structure we want, it is stored and added to the relational database, in our case, the MySQL server. Next, we need to do some serious aggregations (read: computes), and this kind of thing could bring MySQL to its knees. This is where zero-ETL comes in—moving data from MySQL to Redshift. Using the Redshift engine, we can do the analysis.

When we do our data transforms the traditional ETL way, using external systems or vendor products, there is always that little security factor to consider. Sometimes vendor products might be sending telemetry (some say they use the information to make improvements!) and, for the risk-averse, or being in a regulated industry, this might be a big no-no. With zero-ETL, there is no data passing through any external systems, thus reassuring for those with security concerns. The entire time this zero-ETL process happens, just like Aurora, the data is encrypted at rest and during transport to Redshift.

Use Cases for Zero-ETL

We can use zero-ETL to pull in data from multiple databases into one Redshift instance. This approach gives you near-instant analytics against transactions that are coming in hot. The complexity of managing multiple ETL processes is eliminated in one simple dashboard showing all the integrations and their status. There may be some instances where AWS will reseed the tables, i.e., rewriting the entire data on Redshift:

  • When you run commands adding and renaming columns at the same time
  • Adding a column with a default CURRENT_TIMESTAMP
  • Adding a column that is not at the end of the table

Another plus to the new zero-ETL service is that AWS absorbs the costs of transferring the data from Point A to Point B, in this case, Aurora to Redshift. There are no compute cycles necessary to move the data over. Whatever Aurora has will appear on Redshift. Because of multi-tenancy and the fact that it must process every change via CDC (change data capture), it requires some time for the data to appear, ranging from milliseconds to less than a minute. When there are no more data pipelines to maintain and troubleshoot, keeping it “thin” for data engineers to focus on other tasks gives them more time to write more optimized Redshift queries.

The benefits behind zero-ETL are a tremendous boon for any data shop. It is easy and reliable to run, AWS maintains the data transfers, there’s low latency with the transfers, and insights for your data visualizations show up instantly. There is no scheduler to maintain, no cron jobs, no refreshing some UI (user interface) wondering when your orchestration job will kick off. This is automation at its finest, and it is a common objective of a shop maintaining data pipelines.

In talking with the zero-ETL team, they have more things up their sleeves, but there is much that is not disclosed yet. However, I can see this tool expanding to incorporate more database systems over the next few years, such as DocumentDB, S3, and whatever the AWS DMS (Database Migration Service) product supports. I anticipate having the ability to customize how it replicates and working towards minimizing scenarios where reseeds happen. I am excited about seeing zero-ETL used more out in the wild, getting us out of the data pipeline maintenance business.

Using one example of a pain point using a more traditional approach, if we were to copy data from DynamoDB to Redshift, we’d need to set up Dynamo Streams or Kinesis Data Streams plus a Lambda to parse the JSON and copy the contents to S3, and then use Redshift Serverless Spectrum to read the S3 data; it’s a given that we jump through a few hoops during the setup. Now, if the flow was disrupted even with dead-letter queues set up, there are no guarantees we can recover seamlessly. This is one of many ETL pain points I am hoping that zero-ETL solves.

Looking Ahead for Zero-ETL

In the future, I would also like to see zero-ETL provide the ability to filter out certain records from going to Redshift. In some scenarios, it does not make sense to have massive historical data duplicated into various systems. The cost impact is negligent, since Redshift separates compute and storage; thus, the storage costs are low. As much as I would like to delete some unnecessary data on Redshift, it is not possible without deleting the source data and then resync; we would bring over the entire table data again. It might be strategic to filter and then copy data from one Aurora table to another before doing the integration—however, it would require another pipeline, which we want to avoid!

Update: Aurora MySQL and PostgreSQL to Redshift zero-ETL are considered Generally Available. Zero-ETL from DynamoDB to Redshift are in limited preview. There is also a zero-ETL integration between DynamoDB and Amazon OpenSearch Service.

If you are looking for help in implementing AWS zero-ETL or have a database migration need, please feel to reach out to Slalom or myself on LinkedIn and we can get a conversation going.

--

--

Rob Koch
Slalom Build

Cloud architect with an economics background, history buff. Married to Wendy for 27 years and counting, parent of 3 kiddos!