Custom JDBC dialect for Hive

Recently I’ve answered a JIRA ticket SPARK-22113. It is reported an issue when using JDBC to connect Hive in SparkSQL.

Usually we will use a session with hive support to connect with Hive:

val session = SparkSession.builder().enableHiveSupport().getOrCreate()
session.sql("select dw_date from tfdw.dwd_dim_date limit 10").show()

With JDBC, in theory you are able to connect Hive as like:

val ds =
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://iZ11syxr6afZ:21050/;auth=noSasl")
.option("dbtable", "tfdw.dwd_dim_date")

However, the returned results are not correct and looks like you select a string literal.

After few tests, I found that default JDBC dialect in SparkSQL doesn’t work with Hive at all. The default way in the default dialect to quote identifiers are using double quotes. A SQL query like SELECT “dw_date” FROM table… will be parsed by Hive to select a string literal, instead of a column named “dw_date”. By replacing quotes with backticks, seems the issue is resolved. However, in my test, the column names get from Hive are all prefixed with the table name like table.dw_date. But you can’t directly wrap backticks around it like `table.dw_date`. Alternatively, we need to wrap each part individually:

private case object HiveDialect extends JdbcDialect {
  override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2")
  override def quoteIdentifier(colName: String): String = {
colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)

After registering this Hive JDBC dialect with SparkSQL, you can now connect with Hive via JDBC.

Because JDBC seems not a recommended way to use Hive under SparkSQL, currently this patch isn’t submitted as a PR. But this can be a reference for adding custom JDBC dialect into SparkSQL.

Like what you read? Give Liang-Chi Hsieh a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.