AWS: Hudi: Upgrade to 0.12.2 from 0.10.1 ( EMR on EKS )

Life-is-short--so--enjoy-it
6 min readJan 30, 2024

--

I worked on upgrading Apache Hudi version to 0.12.2 from 0.10.1

Intro

I worked on upgrading Apache Hudi version to use the Apache Hudi’s enhanced feature exporting data to the different AWS S3 bucket.

I confirmed that Apache Hudi v0.14.1 has the feature, so I decided to upgrade to v0.14.1. ( Actually, later I decided to stay on v0.12.2 )

Surprisingly, up to the v0.12.1, it wasn’t possible to export the Hudi Table data in a AWS S3 bucket to another bucket. As of v0.12.2, it’s possible.

In this post, I tried to document what I had to consider to upgrade the Hudi version in my current context.

Possible to direct-upgrade to v0.14.0 from v0.10.1?

NOPE. Unfortunately, it is NOT possible to jump to v0.14.0 from v0.10.1 at one shot. When it was tried, Hudi failed with the error related to “Table Version”.

In Apache Hudi, there is a notion called “Table Version” which is stored in “hoodie.properties” file where Hudi Table data is stored.

hoodie.table.version in hoodie.properties file in the storage

Apache Hudi supports auto-migration when the version difference in the “Table Version” is +1. ( Only upgrading works. Downgrading requires to use Hudi CLI tools )

It means that to use the auto-migration feature in Apache Hudi, the existing Apache Hudi Table has to be upgraded to every version where there is an change in “Table Version”.

I guessed that the simplest way to go up to 0.14.0 from 0.10.1 is re-writing the existing data with Apache Hudi 0.14.0. It might be a faster approach, but it’s very resource insensitive. ( Depending on the cost, this can be used. )

Upgrading to every version is time-consuming, so I thought about why I had to upgrade the version.

Basically, it’s not possible to upgrade to v0.14.0 directly from 0.10.1. And, upgrading to every version is time consuming.

So, I re-thought about the main reason why I needed to upgrade the Hudi version. The reason to upgrade was because I wanted to use the feature exporting Hudi Table to another AWS S3 bucket.

So, I decided to check the lowest Hudi version that supports the enhanced feature to reduce the time for the version upgrade.

To test it, I had to build custom EMR Images.

Researched EMR Image, Spark, Scala, Hudi, Hudi, Table version

I use AWS EMR Container. To create custom EMR Image, the provided EMR Container Images have to be used as a Base Image.

Each EMR Container Image version has different Application versions. I researched the Application version to build the custom EMR Image with the required or the best version.

Here were the list of Applications that I checked.

  • EMR Image version ( deciding the Spark version )
  • Scala version ( deciding Hudi related Binary to download from Maven )
  • Hudi version
  • Hudi Table version
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html
Built available application versions on each EMR Container Image

Built and Tested custom EMR Images

After the research in the application versions, I built the version combination that I wanted to use like below.

The important thing was the latest Hudi version on each minor version that has the Table Version change.

The list of Hudi versions were

  • Hudi 0.11.1
  • Hudi 0.12.2 ( 0.12.3 had a AWS related bug, so 0.12.2 was picked )
  • Hudi 0.13.1
  • Hudi 0.14.0

And also, I chose the EMR Image version as well. It’s mostly based on the Spark version that was supported by Apache Hudi. I tried to pick the latest ( highest ) Spark version in the supported versions by each Hudi version.

Unfortunately, the latest supported Spark version didn’t alway work well with Hudi. For example, Hudi 0.11.1 didn’t work well with Spark 3.3.2 although it’s supported version. ( even some Spark version has a critical bug )

After I built the list of version combination, I built the custom EMR Image and executed the functionality test. I more focused on checking the Hudi version that supported the feature I looked for.

Luckily, I found that Hudi 0.12.2 had the feature I wanted.

Regression Impacts

I found the Hudi version that I wanted to upgrade to. It determined the Hudi versions I had to go through.

  • 0.11.1
  • 0.12.2

Since each version upgrade could introduce breaking changes, I went through the each version’s changes and checked how the changes could affect the ETL I had. ( and tested and updated if it had to be )

Deploying new Hudi version to Prod

I was ready to deploy the new Hudi version with the new custom EMR Image.

I was able to just deploy the new version to Prod, but I didn’t do that.

I took a little more safe approach that utilized two duplicated pipeline with different versions like below.

Since any types of negative regression impact could affect all users and services, I decided to be conservative.

Setup dual ETL pipeline to deploy new version safely

Data Validation by comparing data from two duplicated pipelines

There is little long story about this, but I will skip this part.

In short, I used AWS Glue to run the data validation.

EXTRA: Source of Binary Files for Apache Hudi ( JARs )

In the custom EMR Image, I don’t include Hudi binary files ( JARs ). I more prefer to keep them outside of the custom EMR Image and store them in AWS S3.

Those binary files ( JARs ) can be downloaded from Maven ( OSS ) or from EMR Base Image.

There are two core binary files to use Apache Hudi. The binary file names are little different on each Apache Hudi version depending on which Apache Spark it supports.

  • hudi-spark3-bundle_2.12–0.11.1.jar
  • hudi-utilities-bundle_2.12–0.11.1.jar

EXTRA: EMR Image release is different in EMR and EMR on EKS

The list of Amazon EMR Images versions can be different depending on which EMR service is used. I thought the Amazon EMR Image versions are same across all EMR services.

Amazon EMR Image release: the minor version doesn’t exist in EMR on EKS

EXTRA: As of Amazon EMR Image 6.9, Amazon ECR Public Gallery is available

The Amazon EMR Images older than 6.9 are hosted in the account specific Amazon ECR. This required login into ECR to download the Amazon EMR Images. ( even when building custom Image )

As of Amazon EMR Images 6.9, the Amazon EMR Images are hosted in public ECR, so the ECR login is not required anymore.

docker pull public.ecr.aws/emr-on-eks/spark/emr-6.15.0:latest

--

--

Life-is-short--so--enjoy-it

Gatsby Lee | Data Engineer | City Farmer | Philosopher | Lexus GX460 Owner | Overlander