Working with different versions of Apache Hive with Spark service in IBM Cloud Pak for Data

Published in

IBM Data Science in Practice

4 min readJan 25, 2021

Apache Hive is very popular data warehouse software. It is widely used for reading, writing, and managing large datasets residing in distributed storage which could be HDFS or S3 bucket using SQL. Structure can be projected onto data already in storage.

Typically you will have multiple use cases where the will need to connect to Apache Hive from Spark service running in IBM Cloud Pak for Data.

You may be running on different version of Apache Hive and client libraries required to connect to Apache Hive may not be present on the Spark runtimes running in IBM Cloud Pak for Data. The purpose of this story is to share different ways in which you can add required client libraries to connect to different version of Hive.

You can choose to either add Apache Hive client jars either in inline ie you can download your jars in you Spark Notebook and use in notebook or you can have the jars persisted permanently so that you do not have download in each notebook or spark application run.

In this story we will use example that you want to connect to Apache Hive version 2.3.6 and you have a use case to connect to Apache hive using jdbc connector. In order to connect to Apache hive for this use case you will need following client jars

Adding Apache hive client jars for permanent persistence

Using dbdriver

If you are using either Spark notebook or Spark application you can upload your jars in dbdriver folder to connect to Apache hive. Here are the steps to achieve same

1. Download Hive server jars & copy these jars to CPD bastion host.

wget https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.3.6/hive-jdbc-2.3.6.jar

wget https://repo1.maven.org/maven2/org/apache/hive/hive-service/2.3.6/hive-service-2.3.6.jar

wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.6/hive-exec-2.3.6.jar

2. List down the ibm-nginx pod

$ oc get pods | grep ibm-nginx

e.g output:-

ibm-nginx-78b5fc89c7-dnv9t 1/1 Running 0 25d

ibm-nginx-78b5fc89c7-ftfs7 1/1 Running 0 25d

3. Choose any one ibm-nginx pod name, and copy hive jars to /user-home/_global_/dbdrivers path of the pod.

$ oc cp hive-exec-2.3.6.jar ibm-nginx-78b5fc89c7-dnv9t:/user-home/_global_/dbdrivers/hive-exec-2.3.6.jar

$ oc cp hive-service-2.3.6.jar ibm-nginx-78b5fc89c7-dnv9t:/user-home/_global_/dbdrivers/hive-service-2.3.6.jar

$ oc cp hive-jdbc-2.3.6.jar ibm-nginx-78b5fc89c7-dnv9t:/user-home/_global_/dbdrivers/hive-jdbc-2.3.6.jar

Verify whether jars are copied to the path correctly

$ oc exec ibm-nginx-78b5fc89c7-dnv9t — ls -lrt /user-home/_global_/dbdrivers

e.g output:-

total 66104

drwxr-xr-x. 3 1000321000 dsx 21 Dec 17 19:17 jdbc

-rw-r — r — . 1 1000321000 root 34210528 Jan 14 09:56 hive-exec-2.3.6.jar

-rw-r — r — . 1 1000321000 root 526169 Jan 14 09:56 hive-service-2.3.6.jar

-rw-r — r — . 1 1000321000 root 115878 Jan 14 09:56 hive-jdbc-2.3.6.jar

Now you can execute Spark Notebook or Spark Application Like below

readJDBC = spark.read.format(“jdbc”).option(“url”, “jdbc:hive2://<hive_server2>:10000/default”).option(“fetchsize”,”100").option(“query”, “select bar,foo from pokes”).option(“driver”,”org.apache.hive.jdbc.HiveDriver”).load()

Using volume service instance only applicable for running Spark applications and not for Spark notebooks

Generate a token to upload the Python package to the storage volume:

curl -i -k -X GET https://<CloudPakforData_URL>/v1/preauth/validateAuth -H 'password: <password>' -H 'username: <user>'

Save the accessToken returned in the cURL response to a variable called TOKEN.
Create a new volume with name appvol by running the following cURL command:

curl -vk -iv -X POST "https://<CloudPakforData_URL>/zen-data/v2/serviceInstance" -H "Authorization: Bearer <ACCESS_TOKEN>" -H 'Content-Type: application/json' -d '{"createArguments": {"metadata": {"storageClass": "nfs-client", "storageSize": "2Gi"}, "resources": {}, "serviceInstanceDescription": "volume 1"}, "preExistingOwner": false, "serviceInstanceDisplayName": "appvol", "serviceInstanceType": "volumes", "serviceInstanceVersion": "-", "transientFields": {}}'

Start the file server on volume appvol where you want to upload your Python packages to by using the following cURL command:

curl -v -ik -X POST 'https://<CloudPakforData_URL>/zen-data/v1/volumes/volume_services/appvol' -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{}' -H 'Content-Type: application/json' -H 'cache-control: no-cache'

Upload the Apache Hive Client from your local workstation to volume appvol at location packages:

curl -v -ik -X PUT 'https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/packages%2Fhive-jdbc-2.3.6.jar' -H "Authorization: Bearer <ACCESS_TOKEN>" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/packages/hive-jdbc-2.3.6.jar'

In the sample code, root/packages/hive-jdbc-2.3.6.jar is the location of your jar on your local workstation that your want to upload. Repeat this step to upload other jars
Now you can run your Spark application with payload like below and connect to Apache hive in your application and include the path /myapp/packages in spark configuration to have the jars preloaded for the spark runtime. If you can looking for more options running your spark application please check this link

{ "engine": { 
     "type": "spark", 
     "conf": { "spark.driver.extraClassPath": "/myapp/packages" ,"spark.executor.extraClassPath": "/myapp/packages"
     }, 
     "volumes": [{ "volume_name": "appvol", "source_path": "", "mount_path": "/myapp" }]}, 
     "application_arguments": ["<your_application_arguments>"], 
     "application_jar": "/myapp/customApps/example.py", "main_class": "org.apache.spark.deploy.SparkSubmit"
 }

Adding Apache hive client jars for temporary persistence for Spark Notebook

If you are just looking to connect to Apache Hive from your notebook and do not want to download jars and keep them persisted, you can download them in your notebook itself and use with in your notebook.

Once you have the jars added in /home/spark/shared/user-libs/spark2 folder you can restart kernel for spark to load these jars.

Summary

The purpose of this story is to present various options you have to add required jars in Spark runtimes to connect to different version of Apache Hive. If you have any feedback or questions please reach out to me via LinkedIn

This blog is written in collaboration with Sunil Ganatra