Enable Python/Scala Notebooks in Hue

5 min readJan 25, 2016

As I scavenged the internet for guides on enabling Spark Notebooks in Hue, I realized that there were many great sources but none that I would consider to be complete. As such, I decided to document my experience so that others could complete the task in a much shorter period of time.

Prerequisites (or things that are assumed to be installed/deployed). I used Cloudera Manager with CDH 5.5.1:

Hue
Spark
YARN
HDFS w/ HttpFS
Cloudera Manager 5 & CDH5.x

Cloudera Manager/Hue Configuration

If you are using Cloudera Manager, go to Hue →Configuration →Advanced → Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini

Enter the following:

[desktop]
app_blacklist=zookeeper,hbase,search,indexer,sqoop,security

Be sure to enable Hue to use HttpFS instead of the HDFS NN by changing “webhdfs_url” to your HttpFS instance. Go to Hue →Configuration → and search for “webhdfs_url” to do this.

For user impersonation to work, you’ll need to add this to core-site-xml (the advanced snippet/safety valve under HDFS →Configuration). IMPORTANT: If this is not set properly impersonation will not work and you will get errors reflecting this in Hue when you go to create a notebook. The bolded value below should be the name of the user that the Livy server is running as. For me, it was root.

<property>
 <name>hadoop.proxyuser.root.hosts</name>
 <value>*</value>
</property>
<property>
 <name>hadoop.proxyuser.root.groups</name>
 <value>*</value>
</property>

Installation of Maven (Only do this if you don’t already have mvn)

Download Maven, create a symlink to usr/bin and create a variable script:

wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gztar -zxvf apache-maven-3.3.9-bin.tar.gzln -s apache-maven-3.3.9/bin/mvn /usr/bin/mvn

Here is the maven.sh I created:

#!/bin/bashMAVEN_HOME=/root/apache-maven-3.3.9 #this is where i untarred
PATH=$MAVEN_HOME/bin:$PATH
export PATH MAVEN_HOME
export CLASSPATH=.

Make the script executable and copy it to /etc/profile.d:

chmod +x maven.sh
cp maven.sh /etc/profile.d/

Make the variables permanent:

source /etc/profile.d/maven.sh

When you run “mvn version” you get an error complaining about incompatibilities, it may be your version of Java or the JAVA_HOME setting. I had to do this:

JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export JAVA_HOME
PATH=$JAVA_HOME/bin:$PATH
export PATH
java -versionjava version “1.7.0_67”

Building of the latest Livy server

Clone the Hue repo:

git clone https://github.com/cloudera/hue.git

Build using Maven:

cd hue/apps/spark/java
mvn -DskipTests clean package

You should see some resulting output that looks like this:

[INFO] Reactor Summary:
[INFO]
[INFO] livy-main …………………………………… SUCCESS [ 17.672 s]
[INFO] livy-core_2.10 ………………………………. SUCCESS [01:03 min]
[INFO] livy-repl_2.10 ………………………………. SUCCESS [04:08 min]
[INFO] livy-yarn_2.10 ………………………………. SUCCESS [ 17.347 s]
[INFO] livy-spark_2.10 ……………………………… SUCCESS [ 41.058 s]
[INFO] livy-server_2.10 …………………………….. SUCCESS [ 41.944 s]
[INFO] livy-assembly_2.10 …………………………… SUCCESS [ 31.118 s]
[INFO] — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
[INFO] BUILD SUCCESS
[INFO] — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
[INFO] Total time: 07:41 min
[INFO] Finished at: 2016–01–25T12:21:34–08:00
[INFO] Final Memory: 48M/602M
[INFO] — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Configuring Livy

If you are going to run in local mode, do this:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/usr/lib/spark
./bin/livy-server

If you are going to run using YARN, do this:

export SPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark'env \
  LIVY_SERVER_JAVA_OPTS="-Dlivy.server.session.factory=yarn" \
  CLASSPATH=`hadoop classpath` \
  $LIVY_HOME bin/livy-server

You could also run it in the background using something like screen or nohup:

nohup env LIVY_SERVER_JAVA_OPTS=”-Dlivy.server.session.factory=yarn” CLASSPATH=`hadoop classpath` $LIVY_HOME bin/livy-server &

Once everything is set up properly, you should see all configuration checks passing when you enter Hue:

If you go to Query Editors → Spark → Notebooks and add a PySpark notebook, you should see this:

Eventually the little box with the spinning icon will disappear, and you will be good to go. If you go to Cloudera Manager →YARN →Applications, you will see this:

Notice that the user is “admin”. That is the user that I logged into Hue with. The Livy server is running as “root”, and the Hue service is running as “hue”. If I log into Hue with a different user and do the same workflow, YARN shows this:

This time I logged into Hue with user “bkvarda”

Finishing Up

Hopefully this was enough to get you started, if you see anything that is incorrect please let me know! Also if this helped it would be great to know as well.

Sources — because attribution is nice

How to use the Livy Spark REST Job Server API for doing some...

Livy is an open source REST interface for using Spark from anywhere. It supports executing snippets of code or programs…

gethue.com

Proxy user - Superusers Acting On Behalf Of Other Users

This document describes how a superuser can submit jobs or access hdfs on behalf of another user. The code example…

hadoop.apache.org

Beta of new Notebook Application for Spark & SQL - Hue - Hadoop...

Hue 3.8 brings a new way to directly submit Spark jobs from a Web UI. Last year we released Spark Igniter to enable…