Data Engineering: Bootstrapping Data lake with Apache Hudi

Krishna Prasad
5 min readJan 20, 2024

In the era where data reigns supreme, handling diverse data types with petabytes of volume poses a formidable challenge. In this context, I will bootstrap a unified data lake that is devoid of:

  1. Growing size into petabytes.
  2. The varsity of data.
  3. Accessible from various sources and for various stakeholders.
  4. Can have update and delete events on ingested data.

Prerequisites:

  1. Knowledge of Apache Spark.
  2. Scala/Python.
  3. Familiar with data lake terminology, and ETL.
  4. Basic knowledge of Apache Hudi.

Utilizing the open-source Apache Hudi, one can effortlessly construct a data lake capable of maintaining ACID properties on extensive datasets, scaling seamlessly to petabyte sizes without the need for intricate infrastructure management, such as large database clusters or servers. This data lake is primarily designed for Online Analytical Processing (OLAP) purposes, catering to downstream users including Data Scientists, Data Analytics, and Business Intelligence (BI) teams.

Here, I will demonstrate the building of a unified data lake with Apache Hudi as a table engine and utilize the serverless technology of AWS Glue for running the ETL pipeline. The simple design flow is as below:

Extract (E)

The data source may take the form of CSV, JSON, MySQL, AWS Aurora DB, MongoDB, or DynamoDB, without any specific constraints. Our main focus lies solely in the extraction of this data. Once we comprehend the extraction process and successfully store it in the Apache Spark data frame, the task is accomplished. Here are a few guidelines on the data extraction process:

  1. As the size of the source data is unknown, it is advisable to adopt an incremental reading approach through batch concepts. It is beneficial to maintain checkpoints, tracking the point until which the data has been read.
  2. AWS Glue provides the Job bookmark (Specifies how AWS Glue processes the job bookmark when the job runs. It can remember previously processed data (Enable), update state information (Pause), or ignore state information (Disable)).
  3. Spark.ReadStream can help to read with micro-batches see more about this in Apache Spark documentation.

Transformation (T)

Transformation, with Apache Spark you can transform your data frame with complex transformations or as simple as casting some columns or extracting and addition of new columns in the existing data frame in the simplest ways e.g:

  1. Change Column Data Type, with withColumn
df.withColumn("salary",col("salary").cast("Double"))

2. Derive a new column from existing withColumn

df2.withColumn("newColumn",col("salary"))
df = df.withColumn(PartitionLabel, date_format(col(partitionColumn), partitionFormat))
# assign the partiton column value to new Column with defined partition format
# This transformation has the big impact on your storage, retrieval
# as well as ingestion cost on AWS Glue

You can do a variety of transformations with a spark withColumn transformation.

An essential aspect of this straightforward transformation that I’d like to highlight is to utilize a single column from the existing data frame with data types DATE, DATETIME, or TIMESTAMP. Create a _partitioncolumn in your data frame, which will act as a partition path field in your Apache Hudi table i.e. `hoodie.datasource.write.partitionpath.field`. This approach offers the following benefits::

  1. Read scalability, As we know AWS S3 provides the read-throughput with the prefix of S3 bucket, it’s within our control now to decide how you want to partition your data, whether by yyyy, yyyy-MM, or yyyy-MM-dd. This not only allows you to manage your read throughput effectively but also enables cost savings on S3 data retrieval via AWS Athena, Pestro, etc.
df = df.withColumn(PartitionLabel, date_format(col(partitionColumn), partitionFormat))

2. Proper partitioning of your data can help to manage petabytes of data efficiently in a single table.

Load (L)

Now Loading into Apache Hudi in AWS S3.

Credit goes to the Apache Hudi open-source community, offering a simplified way to write data into cloud storage, be it AWS S3 or GCP Google Cloud Storage (GCS), requiring minimal code. See the example below:

val options = Map(
"hoodie.table.name" -> destinationTableName,
"hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",// MOR
"hoodie.datasource.write.operation" -> "upsert",
"hoodie.datasource.write.recordkey.field" -> recordkey.toLowerCase,
"hoodie.datasource.write.precombine.field" -> recordkey.toLowerCase,
"hoodie.datasource.write.partitionpath.field" -> PartitionLabel,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.database" -> destinationDatabaseName,
"hoodie.datasource.hive_sync.table" -> destinationTableName,
"hoodie.datasource.hive_sync.partition_fields" -> PartitionLabel,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.datasource.hive_sync.mode" -> "hms",
)

df.write.format("hudi")
.options(options)
.mode(SaveMode.Overwrite)
.save(destinationS3Path)

The key challenge is to identify the compatible Apache Hudi JAR that aligns with both Apache Spark Core and SQL libraries. Today, I experimented with a few libraries and successfully established a basic code setup in AWS Glue using a pre-built bundle of Apache Hudi + Spark libraries. Below are the dependent bundle libraries:

hudi-spark3.3-bundle_2.12–0.14.1.jar

And, my AWS Glue setting is as below:

With only dependent JARS path, that’s the beauty of open source,

And, visit my GitHub account for Simple AWS Glue code which will read from CSV, transform to add partition column field for Apache Hudi, and ingest into S3 with Apache Hudi 0.14.0 version and enable the AWS Glue data catalog for data visibility to end users. The minimum code required to bootstrap the data lake. This can be replicated to any source either transactional or NoSQL and only the Extract part needs to be changed.

Following the steps outlined above and executing them in the AWS Glue job, you will have your data in S3 stored in the Apache Hudi table format, and it will be accessible through the Glue Data Catalog, allowing end users to query from services like Athena, Presto, and others.

And then Access from AWS Athena as an example:

Here, I’d like to highlight the primary inspiration behind this writing: showcasing how a straightforward transformation can store data into various S3 prefixes, addressing significant scalability challenges, and minimizing data retrieval costs on AWS S3. It can be configurable to yyyy , yyyy-MM or yyyy-MM-dd or even yyyy-MM-dd-hh also possible if your input data has values.

Wohoo !!! We have completed a bootstrapping of data lake setup on AWS with minimum code.

follow me on LinkedIn: https://www.linkedin.com/in/krishna-prasad-b8a86223/

--

--

Krishna Prasad

Data Engineer in a FinTech company, Skills: Apache Spark, Apache Hudi, Delta Lake, Scala, Core Java. Email: kprasad.iitd@gmail.com