Access Kerberized Hive from Spark job over JDBC
JDBC is a popular data access technology which is supported by multiple databases, where the database vendors provides drivers implementing the JDBC specification. Applications would set the database drivers in the application classpath to communicate with the underlying database. On the other hand, for the Spark-based applications development, the widely used authentication mechanism is through Kerberos which is a three way authentication mechanism comprising of Authentication Server (AS), Key Distribution Center (KDC), and Ticket Granting Server (TGS) where the Hadoop Cluster and the Hive database is protected using Kerberos. As part of this story, we will find out how to access a Kerberized Hive from a Spark job over JDBC.
The motivation behind this story is to run a Spark job in a multi-node Hadoop cluster, where the application accesses Hive and perform Pyspark-based DataFrame operations on the result. If the environment is non-Kerberized Hadoop cluster where no authentication contexts are involved, then Spark job communicating with Hive over JDBC is a no-complex development. But when the cluster is Kerberized, then we have a specific security requirement to handle — if the DataFrame operations gets executed between multiple nodes then we need to make sure to establish the Kerberos ticket between nodes. And this is what we would specify as part of the JDBC URL of the overall JDBC configuration.
Let’s get going!
Environment
Apache Hadoop 2.10.0 with Apache Spark 2.4.7 and Apache Hive 2.3.7
JDBC Libraries
Make sure following Hive JDBC libraries are present in the Spark jars folder in all nodes of the cluster along with the edge node.
JDBC related libraries
/home/hadoop/spark/jars.all/hive-jdbc-2.3.7.jar
/home/hadoop/spark/jars.all/spark-hive_2.11-2.4.7.jar
/home/hadoop/spark/jars.all/libthrift-0.9.3.jar
/home/hadoop/spark/jars.all/hive-metastore-2.3.7.jar
/home/hadoop/spark/jars.all/hive-service-2.3.7.jar
/home/hadoop/spark/jars.all/hive-exec-2.3.7.jar
Kerberos and common libraries
/home/hadoop/spark/jars.all/apacheds-kerberos-codec-2.0.0-M15.jar
/home/hadoop/spark/jars.all/commons-logging-1.1.3.jar
/home/hadoop/spark/jars.all/guava-14.0.1.jar
/home/hadoop/spark/jars.all/hadoop-auth-2.7.3.jar
/home/hadoop/spark/jars.all/hadoop-common-2.7.3.jar
/home/hadoop/spark/jars.all/htrace-core-3.1.0-incubating.jar
/home/hadoop/spark/jars.all/httpclient-4.5.6.jar
/home/hadoop/spark/jars.all/httpcore-4.4.10.jar
/home/hadoop/spark/jars.all/libfb303-0.9.3.jar
Spark Classpath configuration
Update the SPARK_DIST_CLASSPATH accordingly in all nodes of the cluster and along with the edge node.
SPARK_DIST_CLASSPATH from one of the cluster node
export SPARK_DIST_CLASSPATH=/home/hadoop/spark/jars.all/hive-jdbc-2.3.7.jar:/home/hadoop/spark/jars.all/spark-hive_2.11-2.4.7.jar:/home/hadoop/spark/jars.all/libthrift-0.9.3.jar:/home/hadoop/spark/jars.all/hive-metastore-2.3.7.jar:/home/hadoop/spark/jars.all/hive-service-2.3.7.jar:/home/hadoop/spark/jars.all/hive-exec-2.3.7.jar:/home/hadoop/hadoop/etc/hadoop:/home/hadoop/hadoop/share/hadoop/common/lib/*:/home/hadoop/hadoop/share/hadoop/common/*:/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop/share/hadoop/hdfs/*:/home/hadoop/hadoop/share/hadoop/yarn:/home/hadoop/hadoop/share/hadoop/yarn/lib/*:/home/hadoop/hadoop/share/hadoop/yarn/*:/home/hadoop/hadoop/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop/share/hadoop/mapreduce/*:/home/hadoop/hadoop/contrib/capacity-scheduler/*.jar:/home/hadoop/spark/jars.all/apacheds-kerberos-codec-2.0.0-M15.jar:/home/hadoop/spark/jars.all/commons-logging-1.1.3.jar:/home/hadoop/spark/jars.all/guava-14.0.1.jar:/home/hadoop/spark/jars.all/hadoop-auth-2.7.3.jar:/home/hadoop/spark/jars.all/hadoop-common-2.7.3.jar:/home/hadoop/spark/jars.all/htrace-core-3.1.0-incubating.jar:/home/hadoop/spark/jars.all/httpclient-4.5.6.jar:/home/hadoop/spark/jars.all/httpcore-4.4.10.jar:/home/hadoop/spark/jars.all/libfb303-0.9.3.jar
Restart the Hadoop cluster.
Login to Resource Manager Node
$ ssh root@sicily1.bluemountains.com
root@cornell1.bluemountains.com's password: <root password>
Check if anything is even running at all.
[root@cornell1 ~]# jps
3997 Jps
Stop NameNode — run as the Hadoop-based user
[root@cornell1 ~]# su - hadoop
[hadoop@cornell1 ~]$ stop-dfs.sh
...
[hadoop@cornell1 ~]$ exit
Stop Data Nodes — run as root/secured user
[root@cornell1 ~]# stop-secure-dns.sh
...
Stop YARN — run as rootsecured user
[root@cornell1 ~]# stop-yarn.sh
...
Start NameNode — run as Hadoop user
[root@cornell1 ~]# su - hadoop
[hadoop@cornell1 ~]$ start-dfs.sh
...
Start secured data nodes — run as root/secured user
[root@cornell1 ~]# start-secure-dns.sh
...
Start YARN — run as root/secured user
[root@cornell1 ~]# start-yarn.sh
...
Sample Spark job communicating with Kerberos Hive
Please note the JDBC URL with the auth parameter whose value is Kerberos, and auth type is fromSubject, along with the principal. These attributes are what needed to communicate as part of the multi-node cluster setup.
“jdbc:hive2://florida1.bluemountains.com:10000/testing_data;principal=hive/florida1.bluemountains.com@HADOOPCLUSTER.LOCAL;auth=Kerberos;kerberosAuthType=fromSubject”
cURL to submit the above Spark job to the Livy Server
[hadoop@caltech1 ~]$ curl --negotiate -u : -X POST -d '{"file":"hdfs://cornell1.bluemountains.com:9000/testing_data/spark.khive.jdbc.py", "conf": {"spark.yarn.principal": "yarn/cornell1.bluemountains.com@HADOOPCLUSTER.LOCAL", "spark.yarn.keytab": "/home/hadoop/hadoop/etc/hadoop/yarn.keytab"} }' -H "Content-Type: application/json" "http://caltech1.bluemountains.com:8999/batches"{"id":16,"name":null,"owner":"hdfs","proxyUser":null,"state":"starting","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}
- The URL
hdfs://cornell1.bluemountains.com:9000/testing_data/spark.khive.jdbc.py
is where the above spark job is located in HDFS - The
spark.yarn.principal
andspark.yarn.keytab
are the YARN kerberos principal and keytab file. - The url
http://caltech1.bluemountains.com:8999/batches
is the Livy batches API URL to which we are submitting the spark job request.
Sample Job response
[root@cornell1 container_1612880548752_0008_01_000001]# tail -n 100 -f stdout
Printing the dataframe schema…
root
| — spark_gen_subq_0.country: string (nullable = true)Printing the dataframe columns…
[‘spark_gen_subq_0.country’]
+ — — — — — — — — — — — — +
|spark_gen_subq_0.country|
+ — — — — — — — — — — — — +
| spark_gen_subq_0….|
| spark_gen_subq_0….|
...
+ — — — — — — — — — — — — +
only showing top 20 rowsPrinting the length of dataframe…
717492
Stopping the sparksession…
Conclusion
As part of this story we learned that JDBC is a popular data access technology for applications development, and when Hive is protected using Kerberos we listed down the steps on how to communicate to Kerberized Hive from a Spark Job over JDBC.