Four Steps for Migrating from Hive 2.x to 3.x
Get Hive prepped and ready if you’re moving to Cloudera Data Platform
by Shekhar Parnerkar and Pooja Sankpal
If you are a current HDP 2.6.5 user, you are already contemplating a move to Cloudera Data Platform (CDP) since Cloudera has announced the discontinuation of support for this and older versions of HDP in favor of CDP at the end of 2020. Breathe easy, you are not alone!
At Hashmap, we have many clients who are in the same boat. To make their journey safer and easier, we have developed a framework that provides a step by step approach to migrate to CDP. This framework has been documented in a previous blog post written by technical experts at Hashmap.
This blog post covers the migration of Hive tables and data from version 2.x to 3.x (which is the target version supported by CDP). Other related articles are mentioned at the end of this article. Future posts would include lift-and-shift migration from HDP 2.6.5 to CDP and migration of Spark and Oozie jobs to CDP.
What’s Different from Hive 2.x to 3.x
There are many structural changes between Hive 2.x and 3.x, which makes this migration quite different from routine upgrades. To summarize a few of these changes:
- All managed (Non-transactional) tables need to change to External tables. This includes both — Native tables (data stored in hive_warehouse directory in HDFS) and Non-Native tables (data stored externally in HBase or Druid).
- In addition, for the Native tables above, property external.table.purge needs to be set to True;
- The use of Map/Reduce (MR) as a hive execution engine has been discontinued in favor of Tez.
- The default root directory of Hive has changed to app/hive/warehouse.
- Many Hive configuration properties have changed.
While these appear to be simple changes at first glance, there are many challenges that may arise:
- The latest or most current DDL scripts may not be available to make the changes.
- Identifying the tables that need to be changed.
- Making and testing the change.
Given that Hive is the de-facto Enterprise Data Warehouse, it will usually have thousands of tables spread across dozens of databases. To make these changes manually is very error-prone and time-consuming.
This is where the Hashmap framework comes to help. Here is a step by step workflow to achieve these goals.
Hive 2.x to 3.x in Four Steps
Step 0: Prerequisites
This blog assumes that you have stood up a brand-new CDP cluster or you have upgraded your existing cluster to a Hive 3.x version fully configured. This upgrade should have covered the following:
- All CDP Services are up and running including Hive
- Hive configuration properties have been set to their new values either thorough hive-site.xml or by an upgrade script
- In case you have stood-up a brand-new CDP cluster, all range policies have been imported from HDP and applied to the new cluster.
Step 1: Get the current DDL scripts
If you have them handy in a Git repo, download all such repos to the local file system for auto-correction by the framework. Our auto script will look for .hql files in each repo and make the required changes directly to those files. In case you have chosen an in-place upgrade, the required changes to Hive Metastore and data will be automatically done by the script provided by Cloudera. However, you can use this script to update your repo with desired changes for future use.
If you do not have the current DDL scripts or are not sure if they are current, there is another option to create fresh DDL scripts from Hive Metadata. We have another script that generates a .hql file containing the create table statements for all the tables in a given database. However, please bear in mind that this script will add a LOCATION property to your CREATE TABLE statement, which will point to the location of the table in the current cluster. This location will need to be removed or changed to a new Hive root as per Hive 3.x.
Once, you have the DDL scripts, please arrange them in one or more directories for auto-correction one directory at a time.
Step 2: Run the Script
Download our DLL correction script from this location: https://github.com/hashmapinc/Hashmap-Hive-Migrator
This script takes a local directory as input and recursively traverses it looking for .hql file and makes the following changes to them:
In addition to the above, the script also makes the following changes:
- If you have a CTAS statement, the script will still make the required changes to the created table, subject to the conditions above.
- If you have a LIKE statement, they will be modified according to the same criterion above. However, sometimes manual changes are required. For example, the statement: “Create External Table B as A”, is correct in Hive 3.x only if A is also external, which may not be possible for the Script to determine. Such instances are specifically logged for manual correction.
- If your CREATE TABLE statements have a LOCATION property for a MANAGED table, the location will be changed to the new Hive root directory. (This will be true if you created the DDL script using SHOW CREATE TABLE through our script in Step 1).
After a successful run, the script would make all the above changes to .hql files in the input directory. All changes are logged in a log file and should be reviewed for other issues or exception before applying the script
Step 3: Create/Update Hive Metadata
Next, we apply the modified DDL scripts on the target environment to create new databases and tables. In case of an in-place upgrade, this will not be required. This will need to be done database-by-database or repo-by-repo, depending upon how the DDL was created. If you have a CI/CD pipeline set up to execute your scripts on a new CDP cluster, you can check-in the modified repo and trigger the deployment.
There is another approach to move Hive Metastore. If your current Hive Metastore can create a data dump that is directly readable by your new Metastore database, you can directly move the Metastore to CDP. However, after the import into Hive 3.x, you will need to upgrade your Metastore using a Cloudera supplied Script, which is available as part of the AM2CM upgrade Tool.
Step 4: Migrate Data
Once Hive 3.x Metastore is updated in CDP, we are ready to move data from 2.x to 3.x.
When moving data from Hive 2.x to 3.x, the following approach is recommended:
- The default root directory of Hive has changed to app/hive/warehouse. Therefore, the table data for every managed table should be moved to app/hive/warehouse/<db>/<table_name>. In case of an in-place upgrade, the table locations should be changed accordingly. If this is not possible, change the table to an External table.
- Compact all transactional tables before they are moved or used by Hive 3.x. This is due to changes in the compaction logic. If tables are not compacted prior Hive 3.x will not allow further changes to those tables. You can use the Hive Pre-Upgrade Tool provided by Hortonworks/Cloudera to achieve this result.
- Since we are loading the data in the tables without using an HQL query, the table's statistics (information about partitions, buckets, files, and their sizes, etc.) are not updated in Hive Metadata. Hive will update metadata when the queries are run the first time on this table. This could cause degradation of query performance depending upon the volume of data in the table.
- Move native Hive data to the CDP cluster using ‘distcp’. Please note, moving data for non-native tables from Apache HBase, Impala, Kudu, and Druid will require different approaches. These will be discussed in an upcoming blog.
Each item in the list above list can be broken down into a series of detailed steps to be performed. A complete description of these steps is beyond the scope of this blog. However, they can be made available upon request.
Where Do You Go From Here?
If you’d like assistance along the way, then please contact us. We’ve been working with Cloudera and Hortonworks since 2012 and would be glad to partner with you on anything from assessment and strategy to upgrade implementation.
Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our cloud migration and modernization service offerings.
To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.
Other Tools and Content You Might Like
Shekhar Parnerkar is a Solution Architect and Delivery Manager at Hashmap. He specializes in building modern cloud data warehouses for Hashmap’s global clients. Pooja Sankpal is a Data Engineer at Hashmap.