Access Kerberized Hive from Spark job over JDBC

Published in

IBM Data Science in Practice

4 min readFeb 15, 2021

Spark with JDBC communicating with Kerberized Hive

JDBC is a popular data access technology which is supported by multiple databases, where the database vendors provides drivers implementing the JDBC specification. Applications would set the database drivers in the application classpath to communicate with the underlying database. On the other hand, for the Spark-based applications development, the widely used authentication mechanism is through Kerberos which is a three way authentication mechanism comprising of Authentication Server (AS), Key Distribution Center (KDC), and Ticket Granting Server (TGS) where the Hadoop Cluster and the Hive database is protected using Kerberos. As part of this story, we will find out how to access a Kerberized Hive from a Spark job over JDBC.

The motivation behind this story is to run a Spark job in a multi-node Hadoop cluster, where the application accesses Hive and perform Pyspark-based DataFrame operations on the result. If the environment is non-Kerberized Hadoop cluster where no authentication contexts are involved, then Spark job communicating with Hive over JDBC is a no-complex development. But when the cluster is Kerberized, then we have a specific security requirement to handle — if the DataFrame operations gets executed between multiple nodes then we need to make sure to establish the Kerberos ticket between nodes. And this is what we would specify as part of the JDBC URL of the overall JDBC configuration.

Let’s get going!

Environment

Apache Hadoop 2.10.0 with Apache Spark 2.4.7 and Apache Hive 2.3.7

JDBC Libraries

Make sure following Hive JDBC libraries are present in the Spark jars folder in all nodes of the cluster along with the edge node.

JDBC related libraries

/home/hadoop/spark/jars.all/hive-jdbc-2.3.7.jar
/home/hadoop/spark/jars.all/spark-hive_2.11-2.4.7.jar
/home/hadoop/spark/jars.all/libthrift-0.9.3.jar
/home/hadoop/spark/jars.all/hive-metastore-2.3.7.jar
/home/hadoop/spark/jars.all/hive-service-2.3.7.jar
/home/hadoop/spark/jars.all/hive-exec-2.3.7.jar

Kerberos and common libraries

/home/hadoop/spark/jars.all/apacheds-kerberos-codec-2.0.0-M15.jar
/home/hadoop/spark/jars.all/commons-logging-1.1.3.jar
/home/hadoop/spark/jars.all/guava-14.0.1.jar
/home/hadoop/spark/jars.all/hadoop-auth-2.7.3.jar
/home/hadoop/spark/jars.all/hadoop-common-2.7.3.jar
/home/hadoop/spark/jars.all/htrace-core-3.1.0-incubating.jar
/home/hadoop/spark/jars.all/httpclient-4.5.6.jar
/home/hadoop/spark/jars.all/httpcore-4.4.10.jar
/home/hadoop/spark/jars.all/libfb303-0.9.3.jar

Spark Classpath configuration

Update the SPARK_DIST_CLASSPATH accordingly in all nodes of the cluster and along with the edge node.

SPARK_DIST_CLASSPATH from one of the cluster node

export SPARK_DIST_CLASSPATH=/home/hadoop/spark/jars.all/hive-jdbc-2.3.7.jar:/home/hadoop/spark/jars.all/spark-hive_2.11-2.4.7.jar:/home/hadoop/spark/jars.all/libthrift-0.9.3.jar:/home/hadoop/spark/jars.all/hive-metastore-2.3.7.jar:/home/hadoop/spark/jars.all/hive-service-2.3.7.jar:/home/hadoop/spark/jars.all/hive-exec-2.3.7.jar:/home/hadoop/hadoop/etc/hadoop:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/common/*:/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/share/hadoop/yarn:/home/hadoop/hadoop/share/hadoop/yarn/lib/*:/home/hadoop/hadoop/share/hadoop/yarn/*:/home/hadoop/hadoop/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop/share/hadoop/mapreduce/*:/home/hadoop/hadoop/contrib/capacity-scheduler/*.jar:/home/hadoop/spark/jars.all/apacheds-kerberos-codec-2.0.0-M15.jar:/home/hadoop/spark/jars.all/commons-logging-1.1.3.jar:/home/hadoop/spark/jars.all/guava-14.0.1.jar:/home/hadoop/spark/jars.all/hadoop-auth-2.7.3.jar:/home/hadoop/spark/jars.all/hadoop-common-2.7.3.jar:/home/hadoop/spark/jars.all/htrace-core-3.1.0-incubating.jar:/home/hadoop/spark/jars.all/httpclient-4.5.6.jar:/home/hadoop/spark/jars.all/httpcore-4.4.10.jar:/home/hadoop/spark/jars.all/libfb303-0.9.3.jar

Restart the Hadoop cluster.

$ ssh root@sicily1.bluemountains.com
root@cornell1.bluemountains.com's password: <root password>

Check if anything is even running at all.

[root@cornell1 ~]# jps
3997 Jps

Stop NameNode — run as the Hadoop-based user

[root@cornell1 ~]# su - hadoop 
[hadoop@cornell1 ~]$ stop-dfs.sh
...
[hadoop@cornell1 ~]$ exit

Stop Data Nodes — run as root/secured user

[root@cornell1 ~]# stop-secure-dns.sh 
...

Stop YARN — run as rootsecured user

[root@cornell1 ~]# stop-yarn.sh 
...

Start NameNode — run as Hadoop user

[root@cornell1 ~]# su - hadoop 
[hadoop@cornell1 ~]$ start-dfs.sh 
...

Start secured data nodes — run as root/secured user

[root@cornell1 ~]# start-secure-dns.sh 
...

Start YARN — run as root/secured user

[root@cornell1 ~]# start-yarn.sh 
...

Sample Spark job communicating with Kerberos Hive

Please note the JDBC URL with the auth parameter whose value is Kerberos, and auth type is fromSubject, along with the principal. These attributes are what needed to communicate as part of the multi-node cluster setup.

“jdbc:hive2://florida1.bluemountains.com:10000/testing_data;principal=hive/florida1.bluemountains.com@HADOOPCLUSTER.LOCAL;auth=Kerberos;kerberosAuthType=fromSubject”

cURL to submit the above Spark job to the Livy Server

[hadoop@caltech1 ~]$ curl --negotiate -u : -X POST -d '{"file":"hdfs://cornell1.bluemountains.com:9000/testing_data/spark.khive.jdbc.py", "conf": {"spark.yarn.principal": "yarn/cornell1.bluemountains.com@HADOOPCLUSTER.LOCAL", "spark.yarn.keytab": "/home/hadoop/hadoop/etc/hadoop/yarn.keytab"} }' -H "Content-Type: application/json" "http://caltech1.bluemountains.com:8999/batches"{"id":16,"name":null,"owner":"hdfs","proxyUser":null,"state":"starting","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}

The URL hdfs://cornell1.bluemountains.com:9000/testing_data/spark.khive.jdbc.py is where the above spark job is located in HDFS
The spark.yarn.principal and spark.yarn.keytab are the YARN kerberos principal and keytab file.
The url http://caltech1.bluemountains.com:8999/batches is the Livy batches API URL to which we are submitting the spark job request.

Sample Job response

[root@cornell1 container_1612880548752_0008_01_000001]# tail -n 100 -f stdout 
Printing the dataframe schema…
root
 | — spark_gen_subq_0.country: string (nullable = true)Printing the dataframe columns…
[‘spark_gen_subq_0.country’]
+ — — — — — — — — — — — — +
|spark_gen_subq_0.country|
+ — — — — — — — — — — — — +
| spark_gen_subq_0….|
| spark_gen_subq_0….|
...
+ — — — — — — — — — — — — +
only showing top 20 rowsPrinting the length of dataframe…
717492
Stopping the sparksession…

Conclusion

As part of this story we learned that JDBC is a popular data access technology for applications development, and when Hive is protected using Kerberos we listed down the steps on how to communicate to Kerberized Hive from a Spark Job over JDBC.