Apache Hudi on AWS Glue: A Step-by-Step Guide

8 min readAug 3, 2023

Introduction

As data volumes grow exponentially, organizations seek efficient ways to manage and analyze their data lakes. Apache Hudi (Hadoop Upserts Deletes and Incrementals) has emerged as a powerful solution to simplify incremental data processing and enable real-time analytics on large datasets. This article explores how to set up Apache Hudi on AWS Glue, a fully managed extract, transform, load (ETL) service, to streamline your data lake management and enhance data processing capabilities.

Prerequisites

AWS account: Ensure that you have an active AWS account with appropriate permissions to access and create AWS Glue resources.
S3 Bucket: Create an S3 bucket where your data will be stored. Take note of the bucket name and region.
AWS Glue Interactive Notebook: Set up an AWS Glue interactive notebook in your preferred AWS region.
Amazon Athena: Amazon Athena is an interactive query service that allows you to analyze hudi table data directly in S3 using standard SQL. Ensure you have access to Amazon Athena in the same AWS region as your data.

— Note: For this demo we are going to use Glue 3.0 — which supports spark 3.1, Scala 2, Python 3.

Process flow diagram of Apache Hudi in AWS

Now it’s time to get some code-level hands-on AWS Glue Notebook …………

%etl
%session_id_prefix native-hudi-sql-
%glue_version 3.0
%idle_timeout 60
%worker_type G.2X
%number_of_workers 20
%connections demo_mesh_db
%%configure 
{
    "--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
    "--datalake-formats": "hudi",
    "--enable-metrics": "true",
    "--enable-spark-ui": "true",
    "--enable-job-insights": "true",
    "--enable-glue-datacatalog": "true",
    "--enable-continuous-cloudwatch-log": "true",
    "--job-bookmark-option": "job-bookmark-disable",
    "--spark.dynamicAllocation.enabled":"true"
}

The given configuration is a configuration file for running ETL (Extract, Transform, Load) jobs using Apache Spark on AWS Glue version 3.0. Let’s break down the config step by step:

%etl: This is a magic command used in some notebook environments to indicate that the following configuration is meant for an ETL job.
%session_id_prefix native-hudi-sql-: This specifies the prefix for the session ID used in the job. The session ID is a unique identifier for the Glue ETL job.
%glue_version 3.0: Specifies the version of AWS Glue to be used (in this case, version 3.0).
%idle_timeout 60: Sets the idle timeout in seconds for the Glue ETL job. If there is no activity for the specified time (in this case, 60 seconds), the job may terminate.
%worker_type G.2X: Specifies the worker type for the Glue ETL job. In this case, it's set to G.2X, which corresponds to a specific instance type with a certain amount of resources.
%number_of_workers 20: Sets the number of worker nodes for the Glue ETL job. The job will run with 20 worker nodes.
%connections demo_mesh_db: This specifies the connection name for the data source or data destination. The Glue job will use the specified connection to access the data.
%%configure: This is a special command to configure the Spark environment for the Glue ETL job. The configuration options are provided in JSON format.

Now let’s look at the configuration options provided in the %%configure section:

"--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension": This line sets three Spark configuration options:
spark.serializer=org.apache.spark.serializer.KryoSerializer: Specifies the serializer to be used by Spark for efficient data serialization.
spark.sql.hive.convertMetastoreParquet=false: Disables converting Hive metastore Parquet tables to the Spark native format, which might be useful if you want to use Hudi with Hive tables.
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension: Enables the Hoodie Spark Session Extension, which adds Hudi-specific capabilities to Spark SQL.
"--datalake-formats": "hudi": Specifies the data lake formats to be used. In this case, it's set to "hudi", which indicates that Hudi (Hadoop Upserts anD Incrementals) is the primary data format for the ETL job.
"--enable-metrics": "true": Enables job metrics, allowing you to monitor various metrics related to the ETL job's performance and resource utilization.
"--enable-spark-ui": "true": Enables the Spark UI, which provides a web-based user interface to monitor the Spark application's progress and performance.
"--enable-job-insights": "true": Enables job insights, which provide detailed information and recommendations to optimize the Glue ETL job.
"--enable-glue-datacatalog": "true": Enables integration with the AWS Glue Data Catalog, which is a central metadata repository for storing and managing metadata for various data sources.
"--enable-continuous-cloudwatch-log": "true": Enables continuous logging to CloudWatch, allowing you to monitor and analyze logs from the ETL job.
"--job-bookmark-option": "job-bookmark-disable": Disables job bookmarks, which are used to track the state of the ETL job for incremental data processing.
"--spark.dynamicAllocation.enabled":"true": Enables dynamic resource allocation for Spark, allowing it to adjust the number of executors based on the workload.

Overall, this configuration sets up a Glue ETL job that uses Spark with specific settings and enables various features like Hudi data format, job metrics, Spark UI, job insights, Glue Data Catalog integration, CloudWatch logging, and dynamic resource allocation. The job is intended to process data stored in Hudi format and connect to the data source or destination through the specified connection name.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

In the above code, we are configuring an AWS Glue job using PySpark, which allows us to work with AWS Glue’s powerful data processing capabilities. To do so, we import essential modules from the awsglue package and PySpark-related modules. Subsequently, we create a Spark session (spark) by accessing the spark_session attribute of the GlueContext. Finally, we create a Glue Job (job) using the GlueContext. The Glue Job represents the ETL job to be executed by AWS Glue.

data = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<SOURCE-BUCKET-NAME>/<FOLDER-NAME>/"]
    },
    format="parquet",
    format_options={
        "withHeader": True,
        "optimizePerformance": True
    })

Here we are reading the data from our source bucket, this configuration sets up a dynamic frame in AWS Glue that connects to data stored in an S3 bucket in Parquet format. The data is assumed to have a header row, and performance optimizations are enabled during the processing.

data = data.toDF()

toDF(): This is a method available in AWS Glue dynamic frames, which converts the dynamic frame into a Spark DataFrame. The toDF() method essentially maps the dynamic frame's schema to a Spark DataFrame schema.

database_name = "<DATABASE-NAME>"
table_name = "<TABLE-NAME>"
table_location = f"s3://<TARGET-BUCKET>/<FOLDER-NAME>/"

Once we create a Spark DataFrame, it's time to configure a hudi table properties to ingest the data into the hudi table. therefore we provided the database_name, table_name, and most important the table_location we want to store our hudi data.

hudi_options = {
    'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.recordkey.field': '<PRIMARY-KEY-COLUMN-NAME>',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.partitionpath.field': '<PARTITION-COLUMN-NAME>',
    'hoodie.datasource.write.precombine.field': '<PRECOMBINE-FIELD-NAME>',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hudi.metadata-listing-enabled':'true',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'path': table_location,
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.table': table_name,
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.table.cdc.enabled':'true',
    'hoodie.table.cdc.supplemental.logging.mode': 'data_before_after',
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.mode': 'hms'
}

Hudi properties in the hudi_options dictionary:

hoodie.datasource.write.storage.type: The storage type for writing data to the Hudi table. In this case, it's set to 'COPY_ON_WRITE', which means that Hudi will use Copy-On-Write storage, where each write creates a new version of the entire file.
hoodie.datasource.write.recordkey.field: The field that is used as the record key for identifying unique records during upsert operations. In this case, it uses the field 'id'. The value in this field should be unique per record and is used to determine which record to update during an upsert operation.
hoodie.datasource.write.table.name: The name of the table where data will be written. It should be set to the desired name for the target table.
hoodie.datasource.write.operation: The write operation type. It's set to 'upsert', meaning that data will be updated if it already exists or inserted if it's a new record. 'upsert' is a common operation for handling both inserts and updates in a single write.
hoodie.datasource.write.partitionpath.field: The field that determines the partition path for the data in the Hudi table. It uses the field 'batch_name'. Partitioning is useful for organizing data within the table and can improve query performance.
hoodie.datasource.write.precombine.field: The field that is used for pre-combining records with the same record key during upsert operations. It uses the field 'last_updated_at'. Pre-combining helps to choose the latest version of a record when multiple versions with the same record key exist.
hoodie.datasource.write.hive_style_partitioning: Specifies whether to use Hive-style partitioning. It's set to 'true', meaning Hive-style partitioning is enabled. Hive-style partitioning means that data will be physically stored in directories based on the partition keys, making it more efficient for querying data based on those keys.
hudi.metadata-listing-enabled: Enables listing metadata for Hudi tables. It's set to 'true'. When enabled, it allows for listing and accessing metadata information about the table.
hoodie.upsert.shuffle.parallelism: The level of parallelism for upsert operations. It's set to 2. This property determines the number of parallel tasks that Hudi will use for upsert operations, potentially improving performance for large datasets.
hoodie.insert.shuffle.parallelism: The level of parallelism for insert operations. It's set to 2. This property determines the number of parallel tasks that Hudi will use for insert operations, potentially improving performance for large datasets.
path: The location where the Hudi table is stored. It's set to an S3 bucket location with the folder path specified.
hoodie.datasource.hive_sync.enable: Enables syncing data between Hudi and Hive. It's set to 'true'. When enabled, changes made to the Hudi table will be automatically synced to the corresponding Hive table.
hoodie.datasource.hive_sync.database: The name of the database in Hive where the data will be synced. It should be set to the desired Hive database name.
hoodie.datasource.hive_sync.table: The name of the Hive table where the data will be synced. It should be set to the desired Hive table name.
hoodie.datasource.hive_sync.partition_extractor_class: The class used to extract partition keys during Hive sync. It uses the class 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. This class is responsible for extracting the partition keys from Hudi records and populating the corresponding Hive partitions.
hoodie.table.cdc.enabled: Enables Change Data Capture (CDC) for the Hudi table. It's set to 'true'. When CDC is enabled, Hudi captures and logs changes to the data, which can be useful for data synchronization and incremental processing.
hoodie.table.cdc.supplemental.logging.mode: Specifies the CDC supplemental logging mode. It's set to 'data_before_after', meaning both old and new values are logged. This mode logs the old and new values of changed records, providing more information for CDC processing.
hoodie.datasource.hive_sync.use_jdbc: Specifies whether to use JDBC for Hive sync. It's set to 'false'. When set to 'false', Hudi will use Hadoop's Hive Metastore service for syncing with Hive.
hoodie.datasource.hive_sync.mode: Specifies the mode for Hive sync. It's set to 'hms', meaning the sync will be done using the Hive Metastore service. The HMS mode is one of the options available for syncing Hudi data with Hive.

These properties configure various aspects of Hudi behavior, such as data storage, write operations, partitioning, metadata handling, and Hive synchronization. The specific values you choose for these properties depend on your use case and requirements.

data.write.format("hudi").options(**hudi_options).mode("append").save()

Now finally we are about to commit our hudi transaction by providing all the necessary configuration and data. The mode("append") specifies that the data will be appended to an existing table if it exists, or a new table will be created otherwise. Finally, the save() function is called to execute the write operation and store the data in the Hudi-enabled storage, allowing for efficient upserts and incremental processing on big data sets.

Once the data has been successfully written to the Hudi table in S3, you can verify the contents of the table using Amazon Athena. Amazon Athena is an interactive query service that allows you to analyze data directly from S3 using standard SQL queries.

https://www.linkedin.com/in/devjain1299/

Apache Hudi on AWS Glue: A Step-by-Step Guide

Written by Dev Jain