How We Migrated Our Data Lake to Apache Iceberg

Deniz Parmaksız
Insider Engineering
5 min readOct 5, 2022

We recently migrated our production data lake containing tens of terabytes of data in Amazon S3 from Apache Hive to Apache Iceberg and achieved a 90% cost saving for Amazon S3. If you are interested in the cost analysis and its reasons, you should also read our other post about Iceberg. This post is about the migration process instead of the reasons and outcomes.

How to Migrate Existing Data Tables?

The first question we had to answer was how to migrate the existing data lake tables to Iceberg. Basically, there are two main strategies for migrating the data tables. The first way is the in-place migration which keeps the data files as-is and adds Iceberg metadata, whereas the second way is to perform a full migration of the data files as well.

The in-place migration is easier as there is no data rewriting and only adding the pointers of data files to Iceberg metadata. What makes it easier is the migrate procedure of Iceberg which handles everything for you. So just call the procedure with your table name and that’s all. Since it is too easy, there are some drawbacks to this approach. You have to stop all write operations until the migration is completed, otherwise, you have to sync the new data as well, and then the other new data and so on like in the Achilles paradox. So there is some write downtime involved.

On the other hand, it is possible to migrate data without any downtime with the full migration option. However, there are more operations to execute and the migration process is more costly as you have to run both systems in parallel for a while, need to compute resources for reading and writing the data, and store the double amount of data during the transition period. A great advantage of the full migration is to be able to transform the data while migrating, and that was what we actually needed as the existing Hive tables were using ORC file format.

Our Requirements

We decided to use the Parquet file format with z-std compression for our Iceberg tables after performing several benchmarks with our datasets and different file formats. Additionally, we wanted to alter the partitioning strategy for legacy tables. Therefore, full migration was our only choice. Our main tables were huge and it was not possible to just read all the table content and insert it into Iceberg. So we wanted to migrate the tables partially and parallelize the migration job to accelerate the process. There was also a need for monitoring the migration progress. As a result, we architected a migration plan that satisfies these requirements.

Our architecture for migrating data from Hive to Iceberg while providing observability.

Our Migration Plan and Architecture

The migration plan had several steps. The first step is of course migrating all the data in a big-bang fashion. The second step is enabling write operations for new data to go into both Hive and Iceberg tables. The third step is to use Iceberg tables for read operations and the last step is to stop writing into Hive tables and finally delete them after the monitoring period.

We built a temporary pipeline for the migration process. In order to transform the data tables, we used a Spark job that reads data from the Hive table, applies the data transformations, and writes into the Iceberg table.

One of the most important requirements for us was to parallelize the migration job for the same table with multiple Spark jobs. However, the Spark jobs should not migrate the same partitions independently to avoid redundant operations and costs. At the same time, we wanted to monitor each table’s migration progress as well. To achieve that, we used an existing Redis instance to store the migrated and non-migrated partition metadata in a Set data structure.

The Spark job accepts the table name as a parameter and migrates data until there are no partitions left to be migrated. Each job pops some partitions from the Redis set to avoid migrating the same partitions. When the set for that table is empty, the Spark job completes. The Spark job runs as a step in an Amazon EMR cluster and the cluster is configured to shut down after the steps are completed. So when the migration completes, the cluster goes down automatically. And you can have a swarm of clusters to accelerate the migration process. It is also possible to use EMR Serverless if you do not want to provision and manage clusters.

A part of the Grafana dashboard that shows the progress of tables in percentage.

So we summoned a swarm of clusters and Spark jobs that migrates data rapidly, however, we wanted to see the progress and pace of the migration to estimate the required time and to add resources if we wanted to increase the pace. We used Grafana to create a dashboard to monitor migration progress. The migrated partitions data is fetched from Redis using the Redis data source plugin for each table key and the progress of each table is shown in a single dashboard. We visualized the number of partitions and the completion rate as that was enough for our case. The dashboard was very helpful to track the progress during migration and definitely recommended for huge migrations. You can use the Amazon Managed Grafana service if you do not have a Grafana deployment and do not want to manage one.

We kept the Hive data tables for a while after the migration is completed as a backup. When we confirmed the success of the migration, we deleted all the Hive tables’ data in Amazon S3, deleted the RDS instance which was serving the Hive Metastore, and deleted the migration job and services to clean up the code base. The data in Redis had TTL for its keys so we did not explicitly delete them.

A nice clean-up of the S3 objects.

The migration process was a bit tailored for our needs and it worked well for us. The transition was smooth and transparent. We did not experience any issues during or after the migration to Iceberg and are very satisfied with the results. If you do not require to alter table schema or file type like us, you can probably migrate your tables faster and easier with the in-place migration option. Or you may not need to automize the parallel execution of jobs like us. Otherwise, a full migration is required as we did but the outcome is worth the effort.

--

--

Deniz Parmaksız
Insider Engineering

Sr. Machine Learning Engineer at Insider | AWS Ambassador