Custom JDBC dialect for Hive

Liang-Chi Hsieh
2 min readSep 27, 2017

--

Recently I’ve answered a JIRA ticket SPARK-22113. It is reported an issue when using JDBC to connect Hive in SparkSQL.

Usually we will use a session with hive support to connect with Hive:

val session = SparkSession.builder().enableHiveSupport().getOrCreate()
session.sql("select dw_date from tfdw.dwd_dim_date limit 10").show()

With JDBC, in theory you are able to connect Hive as like:

val ds = session.read()
.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://iZ11syxr6afZ:21050/;auth=noSasl")
.option("dbtable", "tfdw.dwd_dim_date")
.load()
ds.select("dw_date").limit(10).show()

However, the returned results are not correct and looks like you select a string literal.

After few tests, I found that default JDBC dialect in SparkSQL doesn’t work with Hive at all. The default way in the default dialect to quote identifiers are using double quotes. A SQL query like SELECT “dw_date” FROM table… will be parsed by Hive to select a string literal, instead of a column named “dw_date”. By replacing quotes with backticks, seems the issue is resolved. However, in my test, the column names get from Hive are all prefixed with the table name like table.dw_date. But you can’t directly wrap backticks around it like `table.dw_date`. Alternatively, we need to wrap each part individually:

private case object HiveDialect extends JdbcDialect {  override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2")  override def quoteIdentifier(colName: String): String = {
colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)
}
}

After registering this Hive JDBC dialect with SparkSQL, you can now connect with Hive via JDBC.

Because JDBC seems not a recommended way to use Hive under SparkSQL, currently this patch isn’t submitted as a PR. But this can be a reference for adding custom JDBC dialect into SparkSQL.

--

--