EMR On EKS: Apache HUDI Schema Automation

Published in

Bazaar Engineering

4 min readMar 20, 2023

The biggest obstacles that organizations face after building their data platform is Automation - making the process easy or empowering DataOps for smooth onboarding of new data sources. At Bazaar Technologies, we constantly strive to prioritise the demands of stakeholders. When there are more stakeholders, there are also more features and automation needs. Engineers are one of our stakeholder groups and are in-charge of adding new data sources to our platform so that data personnel with daily demands may access it.

Bazaar’s data platform, Buraq, operates in a plug-in format, therefore all effort besides creating the schema is minimal.

BURAQ is a real time analytics platform based on EKS and opensource contributions and this is is accomplished by using continuously running HUDI Streamer, which reads data from both the data buckets and Kafka topics and maintains MOR (merge-on-read) tables. It is a good solution in terms of efficiency, operability and observability cost that these tasks are operating on EMR ON EKS, but it’s time to switch gears to something more interesting. The main goal is to onboard data sources painlessly because our data is massive, we have

1000+ Tables
100+ microservices
25+ kafka topics
40+ TB of data processed per hour

Recently, HUDI introduced the concept of Schema Evolution, so here we need to figure out how this will work in our favour, especially in case of EMR on EKS.

Usually metadata is stored on MySQL databases so if Amazon RDS/MySQL EC2 and EKS clusters are not in the same VPC, Spark jobs attempting to connect to RDS will fail.

Apache Hudi Configuration Jobs EMR On EKS

For HUDI Config we need to add a couple of more configurations. The point to note here is the HUDI configuration in the entry point.

Use specified the Hudi utilities bundle. it is always the best approach to store jar files externally whether it’s s3 or any other cloud storage system unlike other version here we will hudi-utilities-slim-bundle.jar
Table Type is READ_ON_MERGE you can use COPY_ON_WRITE it entirely depends on the use-case

Following configuration enables, to store schema on lakehouse as well org.apache.hudi.hive.HiveSyncTool confugration automatically detect schema and create it on hive metastore

there are hive specific configuration, that should be enabled in order to onboard schema automatically

Note that “hoodie.datasource.hive_sync.mode=jdbc” it can be hms if your metadata store on emr thrif
In Spark Submit configuration, i added few important config for RDS and EKS cluster conenction or a thrift service

1. Using JDBC, the Hive metastore database

As we are connecting to a MySQL database in this example, mariadb-connector-java.jar must be supplied with the — jars option. The relevant connection jar has to be included if you’re using Postgres, Oracle, or any other database. although it’s not always best practice to make a connection with rds and password credentials directly.

2. Using thrift service, the Hive metastore

The Hive metastore’s database is Amazon RDS Aurora, and the thrift server is running on EMR on EC2’s master node.Setting up and configuring an application is made simpler by using an EMR on EC2 cluster as a thrift server.
use the above config except make sure use_jdbc is set to false and hive_sync mode is hms

instead of adding thirft config in spark submit, just applicationConfiguration configuration and it will work like a charm

Final Configs

Note that schema evoluation is supported onwards 0.12.0
for emr-container chose image after 6.9

After the job has completed, you will find your data with schema in your metastore, isn’t this super exciting ? From on-wards no need to create, alter and drop command need on hive :)

As always, feel free to share your thoughts and feedback.

Syeda Marium Faheem.

Disclaimer:

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the authors, which might or might not be the actual and official perspective of Bazaar Technologies.