Data Engineering — Part 1 — Implementation of Data Lake using AWS Glue
As the data grow bigger on a large scale, a platform is required to organize it in a proper consumable form to make the best use of it. The saying is “Data is the new oil”. This is why every company needs to have a data lake to ensure valuable insights and trends aren’t lost in tons of different reporting systems and log files.
In this article, we are going to discuss about some easier way to implement a highly robustable, scalable EDL(Enterprise Data Lake) implementation using AWS Glue & other related services.
1. ETL tool(Informatica or Pentaho) — Push the On-premise Data to S3
2. S3 — Data storage Layer
3. AWS Glue — Pyspark — Distributed Data processing
4. AWS Lambda — Trigger
5. AWS Redshift — MPP database
6. AWS Cloudwatch
7. AWS SNS
8. AWS SQS
Readers are recommended to read about AWS Glue from Amazon Website .The following article is concentrated only on the implementation of Data lake using AWS Glue.
As you see from the architecture diagram, data are captured from PostgreSQL, SQL server, oracle and other file formats but not limited to CSV, JSON. A number of data sources are managed by external vendors and are accessed by API calls.
Although there are many tools available for onboarding the sources from On-premise to Cloud, we are recommending to use open source Pentaho ETL tool as it is easy to install and supports to write your own plugins. If there are no budget constraints, you can still explore some expensive tools like Informatica, Talend etc.. :)
AWS Glue is one of the best serverless architecture services for the larger data processing. Unlike EMR you do not need to specify the spark config components like executor, driver memory, etc for each job unless or otherwise if it is really required. It has an option to choose the machine type and number of machines required to process every job.
S3 Storage layer
Initial capturing of all the incoming heterogeneous source data are stored in S3 Landing layer without cleansing nor transforming it. Data from APIs, FTP & SFTP sources are directly copied to S3 Landing layer whereas the DB sources are connected from open source Pentaho ETL tool for the data movement to S3. Further the data will be processed effectively by applying cleansing, transforming & enriching logics in its corresponding buckets using AWS Glue(Pyspark) service.
AWS Lambda Trigger
For every creation of the batch run from different sources through Pentaho, a Manifest file is created in S3 raw bucket layer and is used as a S3 event trigger through AWS Lambda. Whereas in some cases you can even use SQS as an event trigger through AWS Lambda whenever the custom messages come in.
AWS Lambda job will be listening to S3 or SQS for the incoming trigger message or file to trigger its corresponding Glue jobs.
AWS Glue Trigger
AWS Glue trigger is a part of Glue component which makes the job easier for scheduling or triggering the Glue job with the dependencies from any AWS cloud services.
For example, If there is a source to be loaded all the way from SQL server to Redshift
SQL Server (Pentaho) -> S3 -> AWS Lambda -> Glue Job1 -> Glue Job2 -> Glue Job3 -> Final Glue Job(Redshift Load)
Separate Glue trigger will be created for each Glue jobs based on its dependencies. Furthermore, you can also setup more than one dependencies between the jobs.
Glue trigger event will either be Scheduled or On-demand or Event based driven. In this case, Manifest file created from Pentaho triggers the first event trigger Glue Job1 through lambda and its successive jobs will be picked up based on its successful completion. Same task can be accomplished in a clean way with an orchestration tool such as Airflow.
AWS Glue Job
AWS Glue Job is an another key component which facilitates all the custom coding with the business logics to be placed for the run. It supports Python and Scala at the moment. However, Python with spark is predominantly in use for all ETL pipeline movement from the Glue.
Best way to implement the custom code is Creating your own framework module either in Python or Scala, then deploy the package to S3 and invoke your module based on the Source or whatsoever from the Glue job with the simple command. This will be the proper way to run all your modules in the production environment instead of just coding in the Glue Job console itself.
You can reference your framework ZIP (Python) package from the pythonlib parameter from the Job console.
AWS Glue Crawler
Beauty of the AWS Glue is it automatically crawls the data which is stored underneath S3 and creates the table with the data types & columns on its own using the AWS Crawler component.
You just need to mention the file path during the crawler creation and start your crawler on every successful completion of Glue job to update its catalog. It creates Hive Metastore with the data types based on data sampling. One or multiple sources (the crawler can point to the root folder of multiple sources) can be crawled at the same time. The service automatically detects news files. Therefore, there is no need to re-crawl petabytes of data every time crawlers run.
Another great feature is a schema evolution which seamlessly allows on-board schema changes to your lake.
The shared data catalog which is a Hive Metastore — is one of the key components. It enables all AWS services to discover and consume your data sources as well as an easy integration with 3rd party vendors such as databricks. It also can be easily integrated with a data governance platform such as Collibra. Moreover, most of the cloud providers have similar catalogs.
You can even specify your own schema in the form of Custom Grok pattern during the Crawler creation.
AWS Cloudwatch & SNS
SNS can be enabled for each glue job that notifies any failure to the emails in real time, subscribed from the SNS topics.
AWS Glue job captures all of its job logs to the cloudwatch. You can define the rules from cloudwatch to listen to Glue jobs run state (FAILED,SUCCEEDED,STOPPED) and send an SNS notification through Lambda event
In some cases instead of a Manifest file, you can rely on the custom messages coming in to SQS from different integrators for the S3 file posting or any other identical processes.
Lambda will be listening to SQS for the incoming messages and it kicks off the Glue job, if received any.
You can also leverage this option to post messages from the Glue upon the completion of specific Layer/Redshift sources load for the any process integrations.
Migration to Airflow Scheduler
Even though Glue trigger solves the purpose of scheduling the job within the Glue level , it doesn’t support across the different services or in a Hybrid platform.
As many of the industries have huge dependencies involved within On-premise & Cloud , the Airflow scheduler can be chosen for all the orchestrations. (write your own Glue hooks/Operator and compile with the Airflow package before installation).
I will explain in detail about this migration and Airflow implementation in the upcoming articles.
1. Extended version of the current Architecture which supports both real-time and Batch. It does involve some extra components like Kinesis, Spark Streaming, DynamoDB, RDS(Postgresql) & Redis cache
2. How to develop an Enterprise Data Platform on top of EDL and serving the customers in a different ways that includes AWS API-gateway RESTFUL API implementation.
3. Connecting to Enterprise Data lake and applying AI algorithms to build a model using AWS Sagemaker
AWS Glue is the most promising serverless distributed data processing platform for many of the industries’ Daily Data Lake Batch loads and it helps to process petabytes of data with a minimal setup.
If you really enjoyed reading this post, please do like and share it to your connections :)
If you have any questions regarding the Architecture or its implementation DM me(https://www.linkedin.com/in/mohamed-imran-70077b62/) or Seva Konoplich(https://www.linkedin.com/in/skonoplich/).
Authors :- Seva Konoplich, Mohamed Imran