If you need to build an ETL pipeline for a big data system, AWS Glue at first glance looks very promising. Glue is a fully managed service. It can be used to prepare and load data for analytics using Apache Spark. It also provides automatic metadata discovery and Data Cataloging services. Glue’s serverless architecture makes it very attractive and cost-effective to run infrequent ETL pipelines. Glue drastically reduces the management overhead and deployment time to build an ETL data pipeline.
Let’s say, we are building a weather forecasting app. To do accurate forecasting we need to collect weather data. So, we have subscribed to different continent’s weather report providers to collect it. Every service provider has their own style of reporting. So, those datasets have different schema, file format, level of granularity. We want to transform all those reports to a standard format so that our downstream system can leverage all those data.
The architecture will look something like the above. On the left-hand side, service providers are providing the report which is getting uploaded into a central repository. Then Glue is picking up those reports and transforming it into a standard format. This standard format is understandable by weather prediction systems and other downstream applications. Let’s go step by step to understand where and how glue will fit in this data pipeline need.
While considering Glue, the first question that should come to your mind is, is it a real-time application? If the answer is yes, you should reconsider your choice of using Glue. Glue is meant for batch processing only. You will understand over the course of this article why so. Let’s go through Glue features one by one.
In our scenario, there are different data sources sending reports to our centralized data repository. Even though the main contents of the reports are the same (weather report), they might differ in the following ways.
- Schema of the files (column name, column order)
- File Format (Avro, CSV, Parquet, JSON, etc.)
- Structured/Semistructured or both type of data
- Partitioned Data
So, in a scenario like this Glue Crawler can be very helpful. Crawlers can identify the file types, schema, and data types along with any partition information, row counts and many more. Crawlers let you discover and populate Data Catalog from data in S3 or JDBC source. It automatically creates a new catalog table if the table doesn’t exist. It uses Classifiers to identify the schema (column name and data type) information from the underlying data. Glue can understand data partitions and creates columns for the same. Crawlers can detect a change in the schema of the data and update the glue tables accordingly. If your underlying data is changing frequently, you can schedule a crawler to run on new data upload.
We have noticed that the crawler does not consider header row as column name when all the columns are of string type in CSV file. You can get away with this problem by writing a custom Classifier. But you have to specify the columns. This beats the purpose of Crawler.
The timestamp data type is another thing we have noticed which does not work well with Glue. Grok pattern can be used to get away with this problem, but it requires you to write a pattern for all the columns in the CSV which is not ideal.
My recommendation would be to use Parquet or Avro structure to get away with most of these shortcomings.
If data from different folders share common schema, still Crawler creates different tables for each folder path. This is because one catalog table can only refer to one dir and its sub dirs.
Once the reports land in the data repository, we want to do some exploratory work to understand the data. For the schema and metadata information, we can refer to the glue tables created by the crawler. But that might not be enough to understand the incoming dataset. There might be similar information in the dataset represented differently which needs to be normalized. Few column names and types might need fixing. All those things can be done easily in the Data Catalog.
Glue Data Catalogs lets you find the data you need and use it in the tools of your choice. Your data stays where you want it and Data Catalogue helps you discover and work with it. Your data can move around between different S3 path or even JDBC data source as long as the catalog tables are in sync. This way your code need not have to worry about where the data is stored. Glue Catalog can be used as a replacement of Hive Metastore and integrates with existing spark code. Once you have the data catalog ready, you can query the data from any tool that supports external schemas like Athena, Redshift. This enables you to make your data available for querying in a matter of seconds regardless of the data store and data type. Catalogs are generally aware of data partitioning, row counts, skew information and more. This can help to determine how to optimally read the data. Though there is a lack of API to access that information from usual spark jobs.
Glue catalogs are organized into Databases and Tables. The tables maintain 3 main pieces of information. Where data is stored, what is the SerDe (Serialiser Deserialiser) to be used and what is the schema of the data. The data can be stored in the subdirectory of the S3 path provided. The data under the path need to be of the same type because they share common SerDe.
AWS Glue Catalog maintains a column index associated with each column in the data. If the ordering of the columns in the CSV differs across files, Glue will start picking up wrong column data without any warning. This problem does not occur with AVRO, Parquet and JSON data as the schema information is available at the record level.
Glue catalogs are static information. It does not get updated with the latest state of the storage system automatically. Like, if you write a new partition in an existing data location, Glue does not make it available for querying. You have to run the Crawler to make it available for querying via the Catalog table. This is an unnecessary step that delays the readiness of the data for consumption.
Once we understand the data, the next step is to create a Glue Job to transform it into the standard format. Glue gives you a platform to run Spark Jobs either written in Scala or Python. AWS provides GlueContext and DynamicFrame, an abstraction on top of SparkContext and DataFrame respectively to easily connect with Glue Catalog and do ETL transformations. The code can be written in the AWS console UI or even better, can be generated by a few clicks. Though DynamicFrame has some useful ETL features, it has many basic functionality missing.
GlueContext and DynamicFrame are not easy to use in your local development lifecycle. The binaries of these libraries are not available in maven/pip. It is not even available to download as a jar on the AWS website. Thus making it harder to code and compile locally.
Here comes the Glue dev endpoint in rescue. You can create a dev endpoint and setup some configuration in your IDE to develop code locally and run it inside Glue. The endpoint is chargeable for the amount of time it is open to receive a request. Dev endpoint is a very good way to test your code in the build pipeline but not worth the cost during the development. So, we have downloaded the glue-assembly.jar from the dev endpoint by running scp command and used it to compile and build jar of our code. If you are using python, then it is relatively simpler. You can copy-paste the code from Github.
We have found that each Glue Jobs has a cold start time of 10 to 12 min/Job. It generally depends on the amount of resources you are requesting and the availability of the resources on the AWS side. Cold start time is the time between when the job is submitted and when it gets the resources to start executing. This is the major drawback we have found. Assuming you have 5 Jobs in your pipeline, your pipeline will suffer for 50 min just to get resources. Cold starts are very common in a serverless world where the first job takes time to set up the environment. The subsequent jobs start pretty fast as the environment is reused. But once you stop using the environment for some time, in case of Glue, 10–12 minutes, AWS will tear down the environment. So, again you have to bear the cost of setting up the environment. The environments are allocated for each Glue Job. Different Glue Jobs cannot reuse the same environment.
Glue gives you very less control over the job environment. Glue uses DPU as the unit of processing. 1 DPU is 4 vCPU and 16GB RAM. Each DPU runs 2 executors. This configuration might not be suitable for many use cases. Providing Spark config to the Job is not easy to manage. Just recently they started providing the Spark UI. To me, it feels like the tool is not yet fully ready for hassle-free development experience. It will need some more time to get matured.
Now that we have a Crawler and a Job, we need to integrate both together. Once the reports get available in S3, we can run a trigger that will run Crawler first. On the success of the Crawler, it should trigger the Spark Job to do the transformation.
If you have multiple jobs and you want to tie them together, you can use Glue workflow to build your data pipeline. Building a Workflow in AWS console takes only a few minutes.
We have found the workflow UI is not very intuitive. Editing is hard. But generally, you will be editing it via some code. So, that’s not a problem. The problematic part is, once you trigger the workflow, you can’t stop it. You have to go to the currently running job and stop it. That’s the only way to stop the workflow which is annoying. Passing and managing pipeline and job parameters are a bit harder to do.
I’m going to compare Glue with EMR as most likely that’s what you will be using in the absence of Glue. Development cost is the major thing that you will save when using Glue. The time to production will be much faster as you will get many things for free. Like, Data Catalog, Crawlers, Workflow, etc. You will be saved from the maintenance overhead of the infrastructure and can focus on the actual tasks.
Glue Catalog lets you store up to 1 million objects for free which is good enough for most users. On the other hand, running Glue Jobs and Crawlers are very expensive. But you will only pay for the time your ETL job takes to run. The cost is based on the number of DPU used. Cost of 1 DPU ($0.44/hr) can be compared with m5.xlarge ($0.192/hr) Linux instance.
Be aware of the limits of Glue as well while choosing Glue as a solution.
If your ETL pipeline is fairly simple, you can quickly create a Glue ETL pipeline to present something working. But wouldn’t recommend it yet for complex use cases. Managing the software development lifecycle is harder in the case of Glue Jobs.
I feel like Glue has all the potential to give you a complete serverless managed end to end ETL platform for running Spark batch jobs. But it’s not there yet to give a seamless experience. It has to be more developer-friendly to enable developers to code in local with all the required libraries. They have to improve a lot on the cold start and many other areas. As of now, you have to put a lot of hacks here and there to make things work.
If you are relatively new to spark and don’t want to invest much time on deploying and managing the cluster, go ahead with glue. There will be some hiccups. But if your ETL pipeline is not so complex, Glue will be a good fit for you. Glue supports automatic code generation from the AWS console. This will be handy for novice users to change and manage the code.
There might be more pros and cons. I have tried to put my experience with Glue as of now. Please do your research to check if Glue fits your needs…