Custom JDBC dialect for Hive
Recently I’ve answered a JIRA ticket SPARK-22113. It is reported an issue when using JDBC to connect Hive in SparkSQL.
Usually we will use a session with hive support to connect with Hive:
val session = SparkSession.builder().enableHiveSupport().getOrCreate()
session.sql("select dw_date from tfdw.dwd_dim_date limit 10").show()
With JDBC, in theory you are able to connect Hive as like:
val ds = session.read()
.format("jdbc")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.option("url", "jdbc:hive2://iZ11syxr6afZ:21050/;auth=noSasl")
.option("dbtable", "tfdw.dwd_dim_date")
.load()
ds.select("dw_date").limit(10).show()
However, the returned results are not correct and looks like you select a string literal.
After few tests, I found that default JDBC dialect in SparkSQL doesn’t work with Hive at all. The default way in the default dialect to quote identifiers are using double quotes. A SQL query like SELECT “dw_date” FROM table… will be parsed by Hive to select a string literal, instead of a column named “dw_date”. By replacing quotes with backticks, seems the issue is resolved. However, in my test, the column names get from Hive are all prefixed with the table name like table.dw_date. But you can’t directly wrap backticks around it like `table.dw_date`. Alternatively, we need to wrap each part individually:
private case object HiveDialect extends JdbcDialect { override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = {
colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)
}
}
After registering this Hive JDBC dialect with SparkSQL, you can now connect with Hive via JDBC.
Because JDBC seems not a recommended way to use Hive under SparkSQL, currently this patch isn’t submitted as a PR. But this can be a reference for adding custom JDBC dialect into SparkSQL.