Migrating To A Glue Metastore

Matt Weingarten
4 min readApr 15, 2022

--

This was a bad week to stop sniffing glue

Introduction

A few months ago, our greater umbrella team decided to migrate from Qubole to Databricks (yes, the person who convinced us to use Qubole deserves Salesperson of the Year award). With that additionally came a need to migrate from our external Hive metastore over to Glue, so that every component of our architecture would be based in AWS. While you’d think this is a relatively straightforward process, it didn’t turn out to be such a walk in the park.

Creating Tables In Terraform

The move to Glue allowed us an opportunity to store all of our table definitions in version control since we could use Terraform to actually create the tables (yes, we could have used version control before but hindsight is 20/20). In terms of what permissions we gave to Databricks and what we wanted our IAAC to, we established the following core principles:

  1. A single Glue metastore will be used for all environments (suffixes like _dev, _prod would be used to differentiate environments)
  2. All tables, views, and databases are created in IAC
  3. A default temp database will exist where tables/views can be created in Databricks
  4. No deletes or updates to databases allowed for higher environments (QA and Prod) allowed through Databricks

How we do this is that the tables are actually created through Athena and sent to Glue from there. This works for external tables as tables in Athena allow you to use the Hive connectors/associated serdes.

Views, however, can only be created in Athena using Presto (which is Athena’s default), and this is not compatible with SQL in Databricks, which uses Hive. You’ll end up getting an error like this if you try to query a view from an external table in Databricks:

IllegalArgumentException: Can not create a Path from an empty string

To solve this, we create views through Databricks itself by submitting a job to a Databricks cluster that creates the view. We can still store the views in version control and don’t have to bend the permissions we should be able to have in Databricks to handle this, which is a big plus.

IAM Changes

Another thing to note is that the IAM role you use in Databricks needs to have the proper permissions to execute its Glue-related permissions. We do something like this:

{
sid = "GlueQaProd"
effect = "Allow"
actions = [
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:BatchDeletePartition",
"glue:BatchGetPartition",
"glue:DeletePartition",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:UpdatePartition",
"glue:UpdateTable",
"glue:UpdateUserDefinedFunction"
]
resources = [
"arn:aws:glue:us-east-1:accountId:database/default",
"arn:aws:glue:us-east-1:accountId:database/*_qa",
"arn:aws:glue:us-east-1:accountId:database/*_prod",
"arn:aws:glue:us-east-1:accountId:table/*_qa/*",
"arn:aws:glue:us-east-1:accountId:table/*_prod/*",
"arn:aws:glue:us-east-1:accountId:catalog"
]
},
{
sid = "GlueDev"
effect = "Allow"
actions = [
"glue:BatchCreatePartition",
"glue:BatchDeletePartition",
"glue:BatchGetPartition",
"glue:CreateDatabase",
"glue:CreatePartition",
"glue:CreateTable",
"glue:CreateUserDefinedFunction",
"glue:DeleteDatabase",
"glue:DeletePartition",
"glue:DeleteTable",
"glue:DeleteUserDefinedFunction",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:UpdateDatabase",
"glue:UpdatePartition",
"glue:UpdateTable",
"glue:UpdateUserDefinedFunction"
]
resources = [
"arn:aws:glue:us-east-1:accountId:database/*_dev",
"arn:aws:glue:us-east-1:accountId:table/*_dev/*",
"arn:aws:glue:us-east-1:accountId:catalog",
"arn:aws:glue:us-east-1:accountId:database/default"
]
},

We differentiate between Dev and QA/Prod as we want to have the ability in Dev to create and drop tables/databases. In the higher environments, this should all be handled programmatically. Note that including a default database is important for AWS to not complain.

Likewise, we also have to update any IAM roles we use in EMR that connect to the external Hive metastore and give the proper permissions to do the same in Glue. This will look similar to what we have above, but with potentially more permissions (based on what’s involved) as well as including more in the resources array (since the Glue metastore is in another account and we need to cross accounts to do so).

Airflow Changes

In our Airflow setup, we store all our job-related properties in a YAML file that gets parsed in the DAG itself for all stepwise configurations. For any EMR-related configurations that reference the external Hive metastore, those need to be modified to instead connect to Glue. For example, the hive-site.xml and spark-hive-site.xml files will need to look something like this:

hive.metastore.glue.catalogid: 'accountId'
hive.metastore.client.factory.class: 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'

The first property is only needed if you’re connecting to a metastore in a different account. This information can also be found here.

Databricks Changes

Various Spark properties will also need to be added to the Databricks cluster for it to connect to Glue, as shown here:

fs.s3.canned.acl BucketOwnerFullControl spark.databricks.hive.metastore.glueCatalog.isolation.enabled false spark.databricks.hive.metastore.glueCatalog.enabled true fs.s3a.acl.default BucketOwnerFullControl spark.hadoop.hive.metastore.glue.catalogid accountId mapreduce.input.fileinputformat.input.dir.recursive true mapred.input.dir.recursive true spark.sql.hive.convertMetastoreParquet false

Now Databricks should be able to connect properly to the Glue metastore assuming the permissions are in place on the Glue side.

Supporting Hive and Glue in Parallel

Unfortunately, this can’t be done super easily as Hive and Glue use some common properties that step on each other’s toes. We decided to create child DAGs that would handle any Hive needs while the “parent” DAGs would handle the Glue needs. That way, once we support Glue exclusively, we just turn off the child DAGs and go from there. This can be done using ExternalTaskSensors.

The other option is to add another cluster in the same DAG that handles the Hive processing while the other supports Glue, but then when the time comes to support just Glue, you’d have to make changes to the DAG again.

Conclusion

It’s not such an easy shift to migrate from an external Hive metastore to using Glue. Hopefully, the struggles we encountered here will make it a smoother process for others (trial by fire, right?).

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.