A “Less Server” Data Infrastructure Solution for Ingestion and Transformation Pipelines — Part 2
Written by Michael Triska — Data Engineer at AMARO
How Serverless Architecture Patterns and Services Like Snowflake, AWS Glue, and AWS Fargate Ushered a New Period in ETL Development.
Development, debugging, deployment, delivery, and decommissioning are the major portion of any ETL process. That is, there is a need for operational skills to set up and sustain optimized infrastructures for deploying and running data pipelines. This can be especially straining for start-ups and small teams with limited manpower. The time spent on setting up and maintenance can be particularly hard for data specialists, as it takes time away from focusing on their core objectives.
As explained in the first post in this series, we introduced our new data infrastructure of serverless services with AWS Step Functions, to get rid of the problems mentioned above. The following blog post will discuss how to integrate long-running data ingestion and transformation pipelines remaining serverless and how does Snowflake, AWS ECS Fargate launch type and AWS Glue fit in all into your data lake environment.
Snowflake Does Not Require Significant Engineering
The concept of a data lake might be misused and overused a lot. As Fowler (2015) explains:
“The idea is to have a single store for all of the raw data that anyone in an organization might need to analyze. […] There are no assumptions about the schema of the data, each data source can use whatever schema it likes. It’s up to the consumers of that data to make sense of that data for their own purposes.”
Picture 1: Data Warehouse vs. Data Lake. Fowler (2015).
A widely practiced solution is to use a cloud-based object data store like Amazon S3 as a basis for a data lake as it has many advantages in terms of separation of computing, scale, reliability, and cost-effectiveness. It seems simple to use AWS S3 as your centralized data repository, but surprisingly, we learned during our data architecture design process that using S3 in the context of building a data lake comes with many management tasks ranging from replication, disaster recovery, partitions, clustering, file naming conventions, security, and the problem of storing small files¹ resulting in many discussions with the team to find alignments. During our alignments, S3 was even more challenging to deal with the use cases of incremental data ETL processes for updating and deleting data on record-level. The new AWS EMR release 5.28.0 which includes Apache Hudi, might be a possible solution for this issue but requires a highly technical skillset.
However, we learned that Snowflake can provide a level of abstraction over our data in terms of scalability and separation of computing and storage to avoid over-provisioning as a data lake solution. Unlike a Hadoop or Redshift solution, on Snowflake, data storage is kept entirely separate from computer processing which means it’s possible to dynamically increase or reduce cluster size. Moreover, Snowflake’s storage capacity is unlimited and has full support for a wide variety of data types e.g. JSON, Parquet, Avro, and even supports both structured and semi-structured queries within SQL. Another key feature is its ingestion technique called Snowpipe.
But most importantly, Snowflake does not require significant engineering effort nor skill set to roll out and use, unlike common technologies like S3 or HDFSs.
So, how does Fargate fit in all this?
“AWS Fargate is a compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers.” AWS (2019).
The serverless architecture design pattern recommends a kind of infrastructure that hides the concept of servers. We wanted to leverage Docker containers as the deployment artifact of ETL jobs to decompose monolithic applications into many interoperable microservices. There is a lot that’s been said and written about modern microservice infrastructure design patterns, but the beauty of using AWS Fargate for ETL jobs is that services can be implemented using different environments, programming languages, and frameworks which might play a big role in ETL workflows. For example, an initial objective of the project was to build SQL models on top of raw data sources which are then referred to by further models. These workflows can get complex quite easily; especially if you have complex business logic, but SQL as a common language allows to assure interoperability within a modular BI architecture as also business user can write SQL code and but them into production with the help of a data engineer very easily. Python jobs, on the other hand, are mostly suitable to ingest data from external API.
For achieving the aforementioned benefits, the decomposition of the application into microservices has to be done very carefully. A useful guideline for the ETL data lake world is the single responsibility principle (SRP) that separates the workflow responsibility into data loading and data transformation jobs to have one reason to change data sources or tables.
Picture 2: A High-Level Sample Data Pipeline with AWS Fargate and Snowflake.
Picture 2 shows a high-level data pipeline and AWS Fargate tasks to execute extracting and transforming jobs separately. In our situation, we wanted a codebase that serves to improve maintainability in such a way that it is easier for developers to functionally extend the existing code or avoiding to completely rewrite parts after a change in the specification or a redesign as we, for example, faced with Apache Airflow. This change gives us the flexibility to deploy and execute services in a reliable and portable way and to work with many different inputs (data sources) and outputs (data warehouses).
One major drawback of using AWS Fargate in the ETL context is that it suffers some limitations: If your workload will be over 30 GB e.g. for ML retraining processes or you can not use e.g. Snowflakes computing power to process large tables, you will have to consider services like AWS EMR or Glue as part of your solution’s tech stack depending on your use case.
Why We Didn’t Choose for AWS Glue (now)?
You mind wondering why we did not choose AWS Glue to ingest, clean, transform and structure data in our data lake as it is widely recommended and also has plugins with Snowflake.
“AWS Glue is a serverless, cloud-optimized, and fully managed ETL service that provides automatic schema inference for your structured and semi-structured datasets. AWS Glue helps you understand your data, suggests transformations, and generates ETL scripts so that you don’t need to do any ETL development.” Gupta (2018).
I raised the question of why not using Glue for ETL job in this blog post because I could 100% predict that this answer will arise after presenting our new data architecture e.g. in our meet-up (pay attention it’s in Portuguese²).
Previous articles of serverless ETL data infrastructures based on AWS Glue have not dealt with explaining the special use cases. AWS themself notes: “AWS Glue provides a managed ETL service that runs on a serverless Apache Spark environment”, which already shows that it should generally use to transform large sets of data such as machine learning data sets or browsing history.
As this can not serve all kinds of ETL pipelines, AWS launched a feature to run Python scripts for small to medium-sized ETL tasks in January 2019. However, often small to medium-sized ETL tasks need a more complex project structure than a simple ETL script. Also hands-on developing showed that Glue is developed by and for the AWS Console. It can get slow and painful to match all kinds of development data, environments and frameworks to your application as you have to zip and load python libraries into your development endpoint; moreover, C libraries such as pandas are not supported at present. You should also carefully understand AWS Glue pricing and default setups to not to run into unexpected costs (Spinning up a default 10 DPU cluster is slow and has a minimum billing period of 10 minutes which results in a minimum of 0.73 US$ per started ETL job. [$0.44/60 *10 mins *10 DPU = 0.73 US$]).
Avoid Falling Into the Hype of New Technology
In general, we must avoid falling into the hype and marketing campaigns of new technologies. It is challenging to decide what tools and frameworks your organization seeks for, but being honest to yourself, if your organization does not have “big data”, do not follow the hype of introducing complexity with services like AWS EMR or frameworks like Spark, just because hands-on practice will boost your career path. Keep things simple and build a “can-do” mentality once that issue is real or requirements expand; this mindset is what makes us great data specialists. Kupp (2017) already mentioned:
“With very few exceptions, you don’t need to build infrastructure or tools from scratch in-house these days, and you probably don’t need to manage physical servers. The skyscraper is already there, you just need to choose your paint colors.”
While there are many noteworthy features of AWS Glue, there are some serious limitations as well. The strategy and features can not escape criticism when it comes to cost savings, development, compatibility, maintainability, and portability considering modern ETL and fast-moving industry needs (never forget that hiring good, passionate and highly-skilled data professionals is hard). The learning curve for Glue is steep. You will need to ensure that your team comprises of engineering resources that might need a strong knowledge of spark and development concepts using AWS Glue. On the contrary, the data area will continue to gain prevalence. More people will work with data, gain insights from it, and so it should lead to the use of easy-to-use data stacks to deploy data pipelines for collecting, cleaning, manipulating, labeling, analyzing and visualizing — even from people with limited engineering experience. AWS Glue is here to usher in a new paradigm of creating ETL pipelines, where the majority of steps in creating pipelines will be drag, drop, swipe, point, and click. In the future, the benefits of big data will be available to many individuals and companies without the formation of highly professional teams and the involvement of consulting firms.
Our findings in the study of modern ETL architecture patterns had several important implications for future practices in data at AMARO. Modern ETL architecture abstractions must be powerful to allow ingestion and transformation pipelines to be built quickly. Serverless architecture patterns and services generated a huge relief from the massive operational burden of provisioning, configuring, monitoring and remediating infrastructure. Services like AWS Glue still suffer from (in our opinion) serious limitations in common real-use ETL cases that the industry needs, but introduced an exciting path in the data engineering discipline where even non-technical people will be able to release ETL pipelines quickly with little training.
Our last blog post of this series will deal with the design of an AWS Fargate Task External Networking Infrastructure with Snowflake. Stay tuned.
Gupta, Mohit (2018). Serverless Architectures with AWS: Discover how you can migrate from traditional deployments to serverless architectures with AWS. Packt Publishing Ltd, ISBN1789802237.
: There might be real-time processes, such as streaming data, that accumulate many small files on S3; querying that data will kill your performance and budget. Therefore, you will need to take care of another pipeline that will merge the little files. Sounds easy in the first place, but will introduce more processes to take care of and requires knowledge and experience to handle this process in a “best practice” way. #dontlike