Connect to Remote Kerberized Hive from a Local Jupyter Notebook to run SQL queries

Published in

IBM Data Science in Practice

4 min readDec 5, 2020

Let’s keep it simple. Your data is stored in a Kerberized Hive which is part of your Kerberized Hadoop cluster. And from your system, you want to connect to this Hive through a Jupyter notebook to, let’s say, run some SQL queries.

If it is a regular Hive, it is pretty straight forward. And if it is Kerberized Hive, it’s bit tricky — hence this post.

Instead of assuming that your system has so-and-so packages already installed, I have captured all the steps starting from a freshly brewed RHEL 8.2 VM, with the end goal of connecting to a Kerberized Hive from the Jupyter Notebook. So …

I have created a new RHEL 8.2 VM and running Jupyter lab in this VM.
And then, from my MacBook, I am doing port forwarding to the port where Jupyter lab is running, so that I can run the Jupyter notebook lab from my MacBook.

(Having said that, one could run the below steps in your own system as well.)

Here we go!

Login to the system where you want to the Jupyter lab, and make sure Python, Kerberos client, and Java are installed as listed below.

[root@visconti1 ~]# yum module install python36
[root@visconti1 ~]# yum install python3-devel
[root@visconti1 ~]# yum install krb5-workstation krb5-libs
[root@visconti1 ~]# yum install krb5-devel
[root@visconti1 ~]# yum -y install java-1.8.0-openjdk
[root@visconti1 ~]# yum install java-1.8.0-openjdk-devel
[root@visconti1 ~]# yum install npm

You could run the Jupyter lab as a root, but to be more realistic where one would be running the lab from a specific user in their system, I am creating an user called as “hadoop” (please feel free to name the user accordingly!)

[root@visconti1 ~]# useradd hadoop
[root@visconti1 ~]# passwd hadoop
Changing password for user hadoop.
New password: NewPassw0rd1024
Retype new password: NewPassw0rd1024
passwd: all authentication tokens updated successfully.

Download spark along with hadoop, and then untar it.

[root@visconti1 ~]# su — hadoop
[hadoop@visconti1 ~]$ pwd
/home/hadoop
[hadoop@visconti1 ~]$ wget https://mirrors.estointernet.in/apache/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
…
[hadoop@sicily1 ~]$ tar xvzf spark-2.4.7-bin-hadoop2.7.tgz
…
[hadoop@visconti1 ~]$ mv spark-2.4.7-bin-hadoop2.7 spark

Specify the environment variables that are pointing to Python, Spark home, Java and the Kerberos cache location. I am pasting the content from my VM’s bash profile. After editing, make sure you “source” the .bashrc file.

[hadoop@visconti1 ~]# cat ~/.bashrc
…
alias python=python3
export PYSPARK_PYTHON=/usr/bin/python3
export KRB5CCNAME=/tmp/krb5cc1
export JAVA_HOME=/usr/lib/jvm/java
export SPARK_HOME=/home/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PYSPARK_PIN_THREAD=true
…

Install the required Python libraries

[root@visconti1 ~]# python -m pip install — upgrade pip
[root@visconti1 ~]# python -m pip install wheel
[root@visconti1 ~]# python -m pip install pandas
[root@visconti1 ~]# python -m pip install requests
[root@visconti1 ~]# python -m pip install requests-kerberos
[root@visconti1 ~]# python -m pip install pyspark
[root@visconti1 ~]# python -m pip install py4j
[root@visconti1 ~]# python -m pip install nodejs
[root@visconti1 ~]# python -m pip install jupyter
[root@visconti1 ~]# python -m pip install jupyterlab

Now, we need to copy the krb5.conf file and also the necessary Kerberos keytab file from the Kerberos admin server to this VM.

Before that make sure you add the Spark Resource Manager host name and the Hive host name to the /etc/hosts file. In the snippet below, lamy1 is the Spark Resource Manager host name, and sheaffer1 is the Hive host name, and visconti1 is the current VM host name from where Jupyter Lab is running.

[root@visconti1 ~]# cat /etc/hosts
…
10.11.xx.xx2 visconti1.blue.mountains.com visconti1
10.11.xx.xx8 cornell1.blue.mountains.com cornell1
10.11.xx.xx9 florida1.blue.mountains.com florida1

Create a keytabs folder. The keytab file from the Kerebros Admin Server (in our case this is lamy1) will be copied to this keytabs folder.

[root@visconti1 ~]# su — hadoop
[hadoop@visconti1 ~]$ mkdir keytabs
[hadoop@visconti1 ~]$ exit
logout

Login to the Kerberos Admin server VM and copy the /etc/krb5.conf file and the yarn.keytab to the current VM from where Jupyter Lab is running.

[root@visconti1 ~]# ssh root@lamy1
root@lamy1’s password: <password>
[root@lamy1 ~]# scp /etc/krb5.conf root@visconti1:/etc/krb5.conf
[root@lamy1 ~]# su — hadoop
[hadoop@lamy1 ~]$ scp /home/hadoop/keytabs/yarn.keytab hadoop@visconti1:/home/hadoop/keytabs
hadoop@visconti1’s password: <password>
..

Under spark/conf folder location, create a core-site.xml file, with the following single property “hadoop.security.authentication=kerberos”

[hadoop@visconti1 ~]$ cat spark/conf/core-site.xml
…
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
</configuration>Just a run a spark sample command to make sure spark is running[hadoop@visconti1 ~]$ spark-submit /home/hadoop/spark/examples/src/main/python/pi.py 10
..
Pi is roughly 3.141640
..

Generate a Kerberos ticket before starting Jupyter.

[hadoop@visconti1 ~]$ kinit -kt /home/hadoop/keytabs/yarn.keytab yarn/lamy1.blue.mountains.com@HADOOPCLUSTER.LOCAL

Start Jupyter, and please make a note of the token (as highlighted below)

[hadoop@visconti1 ~]$ jupyter lab — no-browser — port=8888
[I 10:17:45.783 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.6/site-packages/jupyterlab
[I 10:17:45.784 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 10:17:45.788 LabApp] Serving notebooks from local directory: /home/hadoop
[I 10:17:45.788 LabApp] Jupyter Notebook 6.1.5 is running at:
[I 10:17:45.788 LabApp] http://localhost:8888/?token=e2a595892389453a880b9c0b5e7f102b0206283991e6def5
[I 10:17:45.788 LabApp] or http://127.0.0.1:8888/?token=e2a595892389453a880b9c0b5e7f102b0206283991e6def5
[I 10:17:45.788 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 10:17:45.795 LabApp]
To access the notebook, open this file in a browser:
file:///home/hadoop/.local/share/jupyter/runtime/nbserver-7967-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=e2a595892389453a880b9c0b5e7f102b0206283991e6def5
or http://127.0.0.1:8888/?token=e2a595892389453a880b9c0b5e7f102b0206283991e6def5
[I 10:18:29.621 LabApp] Build is up to date

Now come back to your system and do a port forwarding as below. In here, we are opening up a port 9999 in my system where it listens to the port 8888 on visconti1.blue.mountains.com (the VM) where Jupyter is running.

Ravis-MacBook-Pro:Downloads ravi$ ssh -N -f -L localhost:9999:localhost:8888 root@visconti1.blue.mountains.com
root@visconti1.blue.mountains.com's password: <password>

Open the Jupyter lab from your system and run a sample notebook which would connect to the Kerberized hive. The notebook is attached.

http://localhost:9999/lab

And finally the moment of truth — where we come up with a notebook connecting to remote Hive. In the notebook, I can create multiple cells, but to keep it simple, I have added all the content in a single cell.

Conclusion

As part of this post, we went through the mechanism on how to run the a Jupyter Notebook connecting to a Remote Kerberized Hive and there by run SQL queries.

Thank You!

Connect to Remote Kerberized Hive from a Local Jupyter Notebook to run SQL queries

Conclusion

Written by Ravi Chamarthy