Transform Your Data Like a Pro with AWS Glue, Serverless Framework — Part 2

Published in

EXSQ Engineering Hub

6 min readFeb 27, 2023

In this next section, we’ll dive into creating a Glue job using the serverless framework. Our goal is to convert the CSV file we uploaded to our S3 bucket into a Parquet file format, which is better optimized for analytical queries. By using the serverless framework, we can easily create, deploy, and manage our Glue job with minimal overhead. Follow along with the steps in this post to learn how to efficiently convert your data from CSV to Parquet using Glue.

Now that you’ve created tables in the Glue Catalog in part 1 of this blog, it’s time to move on to the next step of creating a Glue Job.

Creating AWS Glue Job using Serverless Framework

To create the Glue Job, you’ll be using the Serverless Framework. The Serverless Framework is a popular open-source framework that helps you easily build and deploy serverless applications on AWS.

If you’re not already familiar with the Serverless Framework, you can learn more about it in the official documentation.

To start building your Glue Job with the Serverless Framework, you’ll need to first install the Serverless Framework using npm. Open up a terminal window and enter the following command to install the Serverless Framework globally on your system:

npm install -g serverless

Once you have the Serverless Framework installed, it’s time to create a new Serverless project. To do this, enter the following command:

serverless

This will prompt you to choose what type of project you want to create. For this demo, we’ll be creating a project that uses Node.js and AWS. Choose the first option:

AWS - Node.js - Starter

After you’ve selected your project type, you’ll be asked to name your project. For this demo, we’ll name it “glue-serverless-demo”. Enter this name when prompted.

Once you’ve named your project, the Serverless Framework will create a new project in a directory with the same name. Navigate to this directory in your terminal and open it in your favorite code editor.

In the terminal, navigate to the project directory that was created by the Serverless Framework.

Once inside the project directory, you’ll need to initialize the project as a Node.js project. To do this, enter the following command:

npm init

This will prompt you to answer a series of questions, including the project name, version number, and a brief description of the project. You can choose to either answer these questions or simply press “Enter” to accept the default values.

Open serverless.yml file and you will see the following output —

service: glue-serverless-demo
frameworkVersion: '3'

provider:
  name: aws
  runtime: nodejs18.xfunctions:
  function1:
    handler: index.handler

Now, we need to install some npm packages for the project, specifically: serverless-glue, serverless-prune-plugin, and serverless-python-requirements. You can install all three packages at once by running the following command in your terminal:

npm install --save-dev serverless-glue serverless-prune-plugin serverless-python-requirements

This will add these packages to your project’s package.json file and install them in your node_modules directory.

Next, create a folder called “jobs” inside your project directory, and create a new Python file inside it called “demo_etl_job.py”. Then, replace the contents of the “serverless.yml” file in your project with the following code:

service: glue-serverless-demo
frameworkVersion: '3'

provider:
  name: aws
  region: us-east-1custom:
  prune:
    automatic: true
    number: 5
  pythonRequirements:
    dockerizePip: non-linuxGlue:
  bucketDeploy: <bucket-name> # Required
  createBucket: true
  s3Prefix: glue_resource/etl_scripts/
  jobs:
    - name: demo_etl_job
      scriptPath: jobs/demo_etl_job.py
      type: spark
      glueVersion: python3-4.0
      role: arn:aws:iam::017102671753:role/service-role/AWSGlueServiceRole-glue-demo
      MaxCapacity: 1
      MaxRetries: 0plugins:
  - serverless-glue
  - serverless-prune-pluginpackage:
  include:
    - "jobs/**"
  exclude:
    - "node_modules/**"

This configuration file sets up the serverless framework to deploy the Glue job you’ll create in the next step. The bucketDeploy and s3Prefix properties are required, and you'll need to replace <role-arn> with the ARN of the IAM role you created earlier. The scriptPath property should match the name and location of the Python file you created. Finally, note that the include property specifies the name of the Python file, and the exclude property excludes the "node_modules" folder from the package.

In this Glue job, we are using Spark as the job type. Spark is an open-source distributed computing system used for processing large data sets. It is designed to perform data processing tasks in-memory, making it very fast and efficient for both batch and real-time data processing. Spark is a popular tool for data transformation, machine learning, and other data-intensive tasks.

In the context of AWS Glue, Spark is used as a type for Glue Jobs, which is a serverless data processing solution provided by AWS. Spark is used to perform the ETL (Extract, Transform, Load) operations on the data in the Glue Catalog. Spark can be configured with different Glue versions, such as Python 3.0, 3.1, or 3.4, and can be used to run Glue Jobs with various configurations, such as job type, role, capacity, and retries.

For this specific Glue job, we are using the Python 3.4 version of Spark, indicated by the glueVersion parameter. The role parameter specifies the AWS Identity and Access Management (IAM) role that AWS Glue assumes to run the job. The MaxCapacity parameter specifies the maximum number of AWS Glue data processing units (DPUs) that can be allocated to the job. Finally, the MaxRetries parameter specifies the maximum number of times AWS Glue retries a job if it fails.

For more details on the serverless-glue npm package, please refer to the following link: https://github.com/toryas/serverless-glue.

In the demo_etl_job.py file inside the jobs folder, the following code can be added to perform data transformation on the input data and save the transformed data to an S3 bucket:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from awsglue.job import Job
import pyspark

#### ###creating spark and gluecontext ###############
def transform_and_save(database_name, table_name):
    sc = pyspark.SparkContext()
    gluecontext = GlueContext(sc)
    spark = gluecontext.spark_session
    job = Job(gluecontext)    print("Job Execution Started…")    ## ###Creating glue dynamic frame from the catalog ###########
    input_dyf = gluecontext.create_dynamic_frame.from_catalog(database = database_name, table_name = table_name)    s3_output_path = "s3://glue-serverless-demo/output-data/"
    gluecontext.write_dynamic_frame.from_options(\
            frame = input_dyf, \
            connection_type = "s3", \
            connection_options = {"path": s3_output_path \
                }, format = "parquet")      job.commit()transform_and_save("glue-serverless-demo-db", "input_data")

This code is creating a SparkContext and GlueContext to transform and save data from a Glue DynamicFrame to S3 in Parquet format. The input data is retrieved from the Glue Data Catalog using the database_name and table_name parameters. The transformed data is then written to an S3 bucket at the specified s3_output_path using the write_dynamic_frame method. Once the job is completed, it is committed using the Job.commit() method.

To deploy your Glue job using the Serverless Framework, make sure that you have already set up your AWS credentials in the terminal. Then, navigate to the root directory of your Serverless project and run the following command in the terminal:

serverless deploy

This will deploy your Glue job to your AWS account according to the configuration in your serverless.yml file. Don't forget to run serverless remove when you no longer need your deployment.

After deployment, you can see your glue-job here:

Click on the job name to open the job details page. From there, you can click on the “Actions” dropdown menu and select “Run job” to initiate a job run.

During the job run, Glue will process the data and transform it according to the ETL job you defined in the code. The output data will be stored in the specified output location, which in this case is the “output-data” folder inside your S3 bucket.

Once the job completes successfully, you can navigate to your S3 bucket and check the contents of the “output-data” folder to see the transformed data.

You will see the succeeded job like this:

After the Glue job has finished running, you can verify that the data has been transformed and saved by navigating to the S3 bucket where the output is stored. Go to the ‘output-data’ folder and you should see the transformed data saved in CSV format.

Transform Your Data Like a Pro with AWS Glue, Serverless Framework — Part 2

Creating AWS Glue Job using Serverless Framework

Written by Jagveer Singh