Building a Scalable Multi-DataSource ETL Platform with AWS Glue, DuckDB+WASM

Explore a high-level system design for a multi-data source ETL platform using AWS Glue, DuckDB and WebAssembly.

Tuhin Banerjee
Cloud Native Daily
6 min readJul 30, 2023

--

Introduction

In the age of big data, organizations are dealing with vast amounts of information coming from various sources, ranging from databases to data lakes. Extract, Transform, and Load (ETL) processes play a critical role in consolidating and preparing this data for analytics and business intelligence purposes. AWS Glue, a serverless ETL service offered by Amazon Web Services (AWS), simplifies the process of data integration and transformation.

In this blog post, we’ll explore a high-level system design for a multi-data source ETL platform using AWS Glue. This architecture enables you to efficiently extract data from diverse sources, transform it, and load it into a central data store for further analysis.

System Architecture Overview

1. Data Sources:
The ETL process starts by ingesting data from multiple sources. These sources can include Amazon S3, databases (like PostgreSQL or others hosted on AWS RDS), data warehouses, APIs, or any other data storage systems.

2. AWS Glue:
AWS Glue is a fully managed and serverless ETL service provided by AWS. It allows you to discover, catalog, and transform data from various sources without the need to manage infrastructure. Glue supports a variety of data formats and provides tools to handle schema changes, making it suitable for dynamic data environments.

3. Data Transformation:
Once data is ingested, AWS Glue performs the data transformation process. This step involves cleaning, enriching, and aggregating data as required by the target data model or analytics needs. Glue provides a visual ETL workflow that allows you to build, schedule, and monitor ETL jobs efficiently.

4. Intermediate Data Storage:
During the transformation process, it’s common to store intermediate results. AWS Glue can store these interim datasets in Amazon S3, ensuring data integrity and facilitating auditing and debugging.

5. Data Loading:
After data transformation, AWS Glue loads the processed data into the target data store. Depending on your use case, this data store can be another S3 bucket, an AWS Redshift data warehouse, Amazon RDS, or any other destination supported by AWS Glue.

6. Cataloging with AWS Glue Data Catalog:
AWS Glue provides a Data Catalog service, which automatically crawls and catalogs metadata about data assets in various data stores. The Data Catalog stores table definitions, statistics, and schema information, providing a centralized location to query and discover data assets.

7. Querying with AWS Athena:
AWS Athena is a serverless query service that allows you to perform ad-hoc SQL queries directly on data stored in Amazon S3. By utilizing the metadata cataloged by AWS Glue, data scientists and analysts can easily query and analyze the transformed data using familiar SQL statements without the need to move data or provision servers.

Benefits of the Architecture

1. Scalability and Flexibility:
The use of AWS Glue, a serverless ETL service, ensures automatic scalability and flexibility to handle varying data workloads. As your data sources grow or change, Glue can seamlessly adapt to accommodate these changes.

2. Cost Efficiency:
By adopting serverless services like AWS Glue and Athena, you only pay for the resources consumed during the data processing and querying operations, eliminating the need for upfront infrastructure investment.

3. Simplified Maintenance:
AWS Glue abstracts away much of the underlying infrastructure management, reducing the operational burden on your team. Additionally, the automatic cataloging of data assets simplifies the data discovery process for your data scientists.

4. Real-time Data Availability:
With periodic triggers, the ETL pipeline can run at predefined intervals, ensuring that the data in the target data store is always up-to-date.

Once you have successfully reached the GOLD data stage in your ETL pipeline, the curated and optimized dataset in the form of Parquet files is now ready for further analysis and exploration. In this section, we will introduce two powerful technologies, DuckDB and WebAssembly (WASM), that can be leveraged as part of your data analytics stack to dynamically transform, slice, and dice your data efficiently.

What is DuckDB?

DuckDB is an open-source analytical database management system that excels at analytical queries on large datasets. It is designed to deliver high-performance and low-latency responses, making it an ideal candidate for interactive data analysis tasks. DuckDB is characterized by its ability to efficiently handle complex analytical queries, including joins, aggregations, and window functions while consuming minimal resources.

Key features of DuckDB include:

1. Columnar Storage: DuckDB stores data in a columnar format, which is highly optimized for analytical workloads. This storage layout allows for efficient compression and faster query performance.

2. Vectorized Execution: DuckDB employs vectorized query execution, processing data in batches, which significantly reduces CPU overhead and enhances query performance.

3. In-Memory Processing: By keeping frequently accessed data in memory, DuckDB minimizes disk I/O, resulting in faster query processing times.

4. SQL Compatibility: DuckDB supports standard SQL syntax, making it easy for data analysts and SQL-savvy users to interact with the database.

By utilizing DuckDB as part of your data analytics stack, you can perform complex analytical queries on your GOLD data with remarkable speed and efficiency, enabling quick exploration and analysis of large datasets.

What is WebAssembly (WASM)?

WebAssembly is a binary instruction format designed to enable high-performance execution on web browsers. However, its capabilities extend beyond the web domain, and it can be utilized in various other contexts, including data analytics and processing. WebAssembly allows developers to run code written in multiple programming languages at near-native speeds, making it an attractive option for compute-intensive tasks, such as data analysis.

Key features of WebAssembly include:

1. Portability: Web assembly code can run on multiple platforms, including web browsers, servers, and edge devices, enabling seamless execution across various environments.

2. Language Agnostic: WebAssembly supports multiple programming languages, including C, C++, Rust, and more, giving developers the flexibility to choose their preferred language for computation.

3. Security: WebAssembly runs in a sandboxed environment, ensuring code safety and mitigating potential security risks associated with executing untrusted code.

When combined with DuckDB, WebAssembly can be used to create data analytics applications that perform complex computations efficiently, even in resource-constrained environments. WebAssembly-based applications can leverage the power of DuckDB for querying data stored in Parquet files, enabling dynamic transformation, slicing, and dicing of data in real-time, right in the user’s web browser or other runtime environments.

Sharing some inspirations to build the app

Conclusion

Building a robust multi-data source ETL platform using AWS Glue empowers organizations to efficiently manage their data integration and transformation needs. The serverless nature of Glue ensures scalability, flexibility, and cost efficiency, while the AWS Glue Data Catalog and Athena enable data scientists and analysts to explore and analyze data without the complexities of managing the underlying infrastructure.

By adopting this high-level system design, organizations can focus on deriving valuable insights from their data, enhancing decision-making processes, and staying ahead in today’s data-driven world.

Furthermore, Integrating DuckDB and WebAssembly into your data analytics stack enhances your ability to perform interactive and complex data analysis tasks on the GOLD data stage. DuckDB’s speed and efficiency in handling analytical queries combined with WebAssembly’s portability and performance make for a powerful duo. With this technology stack in place, your data analysts and users can dynamically explore, transform, and analyze data with ease, empowering them to derive valuable insights and make data-driven decisions effectively.

I would love to hear from you if you are building in the DuckDB ecosystem.

--

--

Tuhin Banerjee
Cloud Native Daily

Product Manager with a history of building large-scale enterprise applications.