So What’s All This Talk about Zero-ETL Integration?

Anjan Banerjee
HCLTech-Starschema Blog
6 min readDec 13, 2022

--

Those who followed AWS’s re:Invent 2022 event may have caught multiple mentions of zero-ETL ingestions — but you’d be forgiven for not being caught up with what zero-ETL is and, even more importantly, what it means for your organization. But if zero-ETL just happens to be what your data pipeline needs right now, it’s best to get up to speed as soon as possible, so join me for a crash course on this concept, complete with its benefits and demerits.

Photo by FLY:D on Unsplash

What is Zero-ETL?

Zero-ETL integration is a type of data integration that doesn’t involve the use of the conventional extract, transform and load (ETL) processes. In a zero-ETL setup, data is transferred directly from one system to another without the need for any intermediary steps to transform or clean the data. This approach can be useful in situations where data needs to be transferred quickly and efficiently between systems, without the need for complex data transformation or manipulation.

A data replication tool can also be called a zero-ETL tool. A replication tool will transfer data in near-real-time without requiring any intermediate processing or manipulation.

As I mentioned in my post on data engineering trends for 2023, we at Starschema have been looking forward to how Snowflake’s Unistore or multi-model databases will evolve. With the introduction of zero-ETL integration between Amazon Aurora and Amazon Redshift, AWS is taking steps in the same direction — so let’s take a deep dive into what exactly they announced at this year’s re:Invent.

How is AWS Implementing Its Flavor of Unistore?

Amazon Aurora is a managed relational database service designed to provide the performance and availability necessary for any OLTP system. It offers compatibility with MySQL and PostgreSQL. It excels at storing and managing data for applications that handle large volumes of transactions or data. Aurora is designed to handle workloads of any size, making it a solid choice for applications that need to process and store large amounts of data.

Amazon Redshift, on the other hand, is a fully-managed data warehouse solution. It’s designed to make it easy and cost-effective to store and analyze large amounts of data using SQL and to serve common business intelligence (BI) tools. Amazon Redshift is well-suited for applications that require a scalable, high-performance data warehouse for data analysis and reporting (OLAP).

To bridge the gap between these systems, Amazon came out with zero-ETL replication. Users can now run complex analytical queries on Redshift seconds after the data is written in the Aurora database. This will drastically reduce the time it takes to process the data and ready it for the presentation layer.

Near-real-time replication without manual intervention

Another integration that was announced is between Apache Spark and Amazon Redshift, and it promises to enable data engineers and data scientists to run complex Spark jobs.

Traditionally, developers had to bring the data into S3 from Redshift to play around with Spark jobs on EMR or SageMaker. With this integration, no intermediary data stage is necessary and developers can directly connect to Redshift to create their machine-learning applications and work with near-real-time data.

Exciting propositions for sure, and we can't wait to test out these solutions with real-world data sets.

The Competition

Across the board, we’re observing all major cloud warehousing vendors providing their own flavor of zero-ETL and multi-model solutions.

In addition to Snowflake Unistore and Amazon zero-ETL, Microsoft are pushing towards building a unified platform cloud offering with Azure Synapse Analytics, while Google Cloud Platform also announced its version of zero-ETL earlier this year. Similar to Aurora (OLTP) and Redshift (OLAP), the GCP platform offers Cloud Bigtable (OLTP and NoSQL) and Cloud BigQuery warehousing solution (OLAP). Google released the ability to create federated queries that run on the BigQuery platform by accessing data sitting on Bigtable using External Tables. The ability to query data directly using BigQuery SQL replaces the requirement for ETL technologies like Google Cloud Dataflow to migrate data from BigTable into BigQuery.

The Good and Bad of Zero-ETL

Now that we’ve established that zero-ETL is having a moment among industry leaders, let’s discuss its benefits and drawbacks.

The Good:

  • Speed: Since zero-ETL integration does not involve any data transformation or manipulation, it can be faster than traditional ETL processes. This can be especially useful in situations where real-time data transfer is important.
  • Simplicity: Zero-ETL integration is often simpler to implement and maintain than traditional ETL processes. This is because it does not require any complex data transformation or manipulation and can be set up quickly and easily.
  • Cost savings: Because zero-ETL integration is typically faster and simpler to implement than traditional ETL processes, it can help to reduce the overall cost of data integration. This can be especially important for organizations with limited budgets or resources.
  • Improved data quality: By eliminating the need for data transformation and manipulation, zero-ETL integration can help to improve the overall quality of the transferred data. This can be a big plus in situations where data accuracy is critical.

The Bad:

  • Limited data transformation capabilities: Because zero-ETL integration involves transferring data directly from one system to another without any intermediate steps, it can be difficult to perform complex data transformations. This can be a problem in situations where the data needs to be cleaned, standardized or otherwise manipulated before it gets transferred.
  • Lack of data governance: Traditional ETL processes often include built-in controls and safeguards to ensure the quality and integrity of the data being transferred. Zero-ETL integration, on the other hand, relies on the systems involved in the transfer to handle these tasks. This can make it more difficult to ensure that the transferred data is accurate and reliable.
  • Inability to integrate with other systems: Because zero-ETL integration relies on direct data transfer, it’s not possible to use it to integrate with other systems outside the ecosystem. This limits the flexibility and versatility of the integration process.

Clearly, zero-ETL is a major step within the industry-wide effort to provide a universal data platform to solve both transactional speed requirements and analytical requirements. But its advantages still carry a set of tradeoffs that make its overall effectiveness largely dependent on the user’s environment and needs.

About the author

Anjan Banerjee is the Field CTO of Starschema. He has extensive experience in building data orchestration pipelines, designing multiple cloud-native solutions and solving business-critical problems for multinational companies. Anjan applies the concept of infrastructure as code as a means to increase the speed, consistency, and accuracy of cloud deployments. Connect with Anjan on LinkedIn.

--

--

Anjan Banerjee
HCLTech-Starschema Blog

Senior Solution Director @HCLTech || Former Field CTO @Starschema Ltd