All You Need To Know About AWS Glue Data Studio

Jay Jain
Geek Culture
Published in
5 min readAug 31, 2021
Source: AWS Blog

Introduction To AWS Glue

AWS Glue is a server less data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.

Introduction to AWS Glue Data Studio

AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do.

AWS Glue Studio was designed to help you create ETL jobs easily. After you design a job in the graphical interface, it generates Apache Spark code for you, abstracting users from the challenges of coding. When the job is ready, you can run it and monitor the job status using the integrated UI.

Features of AWS Glue Data Studio

  1. Visual Job editor
  2. Job Script Job editor
  3. Job performance dashboard
  4. Support for dataset partitioning

Illustration

Step 1: Create a Job by choosing any of the options from the following
Here, in this article we are going ahead “Visual with a blank canvas”

Step 2: After you click on create a new blank page will appear, and you can fill the necessary details in “Job Details section. For example: Job Name, description, IAM role to be used, and all the Glue Job parameters can be configured from here

Job Details

Step 3: In Visual section, there are 3 main sub-section Source, Transform and Target.
Step 3.1:
In source section, one will be able to find out different sources which one can use to create glue job such as S3, Redshift, Relational DB, Kinesis, Postgres, and many more

Source

Step 3.2: In Transform section, there are standard transformations available directly like drop fields, rename fields, apply mapping, Select field, etc.
One can also use “join” functionality, Split Fields, Filling Missing Columns as well using the options available on console (UI).
Apart from this one can also transform their data using Spark SQL or by writing piece of code by selecting appropriate options available like Custom Transform or Spark SQL.

Transform 1

Once you select any of the above options, a new property field will appear on side, where one can check the properties of the Node (option) selected and can even preview the data from the Preview Tab.

Transform 2

Step 3.3: After applying relevant transformation, last step is to dump the refined or transformed data in to the relevant destination.
As Glue Studio provides multiple source options similarly for Target they have provided multiple options like Amazon S3, Redshift, RDS, Postgres, and many more.

Target 1

Whenever you select any of the above nodes based on your use case, there will be another property tab which will appear on the side.
For the given example, we are going ahead with selecting Amazon S3 as our target location. Here, one can also select the format in which they need to dump the data.

Target 2

Step 4: After selecting the Target, next step is to save the job and you will see “Successfully created job”. Start this job by clicking Run. Wait for a few seconds and you should see your ETL job Run Status “Succeeded”.
The best part of AWS Glue Studio is one can see the Pyspark Code that Glue Studio has generated and reuse this code for other purposes, if needed.

Step 5: AWS Glue Studio offers a job monitoring dashboard that provides comprehensive information about your jobs. You can get job statistics and see detailed info about the job and the job status when running

Conclusion:
ETL
is the core part of any project especially in today’s world where we say that “Data is the new oil”. So to reduce the development effort w.r.t heavy coding and providing the accessible manageable server less infrastructure AWS have come up with this feature inside AWS Glue known as AWS GLUE STUDIO.

In this article, I have tried my level best to accommodate all the functionalities and feature of AWS Glue Data Studio. I hope this will help with your projects — if you find any points worth mentioning that have been missed, please put them in the comments below.

--

--

Jay Jain
Geek Culture

Senior Data Engineer at Exponentia AI | AWS Certified Solution Architect | BI Tool | ETL | Spark | AWS Glue | Data Warehouse | Big Data