AWS Glue Development Environment

SquareShift Technologies
SquareShift
Published in
2 min readOct 8, 2020
AWS Glue Development Environment

We have built a complete ETL pipeline and data warehouse using AWS Glue and AWS S3 services for EdCast. In this blog, we will share the key learnings from that experience.

Glue Development Environment

Glue ETL scripts can be developed and tested in multiple ways. More prominent options are

  1. Using development end point and notebook (AWS hosted)
  2. Using development end point and Zepplin notebook server in local environment
  3. Using local development using ETL library

Using development end point and notebooks (remote)

AWS development end point is a managed(paid) Glue environment for developing and testing ETL scripts. This environment includes Apache Spark and Glue libraries along with network configuration that allows to securely access the environment from Jupyter notebook.

AWS supports launching a EC2 machine with Jupyter Notebook server. Jupyter Notebook can be used to interactively author and test the ETL scripts, which will be used in Glue jobs.

For more information on development end point:

Pros

  • Easy to launch and use
  • Since the development endpoint is similar to actual AWS Glue environment, it’s easy to develop and test in the actual production like environment.

Cons

  • Still expensive as it requires $1500 for dev endpoint per month
  • Slow development as dev endpoint runs remotely
  • No easy way to write unit test cases

Using local development using ETL library

AWS has recently released the AWS glue libraries which can be used to setup the local development environment. This helps to integrate Glue ETL jobs with maven build system for building and testing.

ETL development can be done using Zepplin server or even using PyCharm (Professional 2019.3) or MS Visual Code. We use PySpark as language for our ETL scripts. So we use PyCharn for developing the scripts. PyCharm allows to run and debug the job scripts locally. It also allows us to remotely debug the issues.

Pros

  • Easy to use and faster development and testing
  • Cheaper
  • Unit testing can be done

For more information on the steps to setup and run the glue jobs locally :

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

In the next blog, we will explain the steps required to setup PyCharm with Glue ETL library for local debugging.

We’re a proud AWS partner. Read all about our AWS practices

--

--

SquareShift Technologies
SquareShift

We aim to be the best & trusted partner for enterprise cloud adoption in the markets we operate.