Soaring with Zeppelin
Three Easy Steps to Setting up an Interactive Environment for Big Data Exploration and Analysis
Our team at Accenture Technology Labs recently had the good fortune to land an enormous amount of data from one of our clients with a very limited timeframe with which to work through it. Doing so was greatly accelerated by a great big bare metal cluster we had lying around and a host of containers and applications that we had pre-canned and ready to go, including a containerised Spark Notebooks environment that became indispensable for prototyping jobs (which we then industrialised into jar files and ran on the cluster using spark-submit).
The exercise left me wondering about how we might be able to make the process even more seamless in future, and while reviewing my Pocket list I rediscovered Apache Zeppelin and got very excited about the possibilities.
Step 1: Cluster-in-a-box
Although Spark works quite well running locally on a single machine, to simulate more production-like environments I have been working on set of Docker images and shell scripts which produce a completely containerised Spark cluster that can run on a single machine.
Cluster-in-a-box comprises a nine node Spark cluster (one of which also runs as the HDFS NameNode and YARN ResourceManager), a Cassandra database, Kafka message broker and SkyDNS powered service discovery mechanism. You can get started with cluster-in-a-box by cloning my Git repository, or downloading a release.
git clone -b no-zeppelin https://github.com/chrishawkins/cluster-in-a-box.git
Note that, in this instance, we are checking out the repository with the tag no-zeppelin. This is because cluster-in-a-box already contains Zeppelin, and cheating is bad.
Assuming you have Docker running, you can start the cluster by using the start-all.sh script. This will pull all the required images from my Docker Hub and boot them up.
./start-all.sh
If you don’t have a Docker environment available to you, I highly recommend CoreOS although you can do this in most Docker environments (one exception is Boot2docker which unfortunately does not play nice with the way the script configures SkyDock).
If you would like to give the cluster a quick test to make sure the nodes are finding each other, you can docker exec yourself a Spark shell for that purpose:
docker exec -ti spark-1 bin/spark-shell --master yarn-client
And then once Spark is ready:
scala> sc.parallelize(Array(1, 2, 3, 4)).sum
Step 2: Build the Zeppelin Docker Image
There are two ways to build an image in Docker. The first is to start a base image with docker run and make any desired changes to it using the shell of that container, committing them back when desired. I don’t like that method (I am not alone) and prefer the cleanness of Dockerfiles. Dockerfiles declaratively specify exactly which commands to execute to get an image to its desired state. This also usually leans to leaner Docker images without too many redundant layers.
So without further ado, I present the start of the Dockerfile for our Zeppelin instance:
FROM chrishawkins/spark-slave
MAINTAINER Chris Hawkins <chris.hawkins@accenture.com># Updates & Install Prerequisites
RUN apt-get update && apt-get upgrade -y && apt-get install -y wget curl npm gitWORKDIR /tmp/# Maven 3.1 Install
RUN wget http://download.nextag.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz && \tar -xzvf apache-maven-3.3.3-bin.tar.gz -C /usr/local && \ln -s /usr/local/apache-maven-3.3.3 /usr/local/apache-maven && \ln -s /usr/local/apache-maven/bin/mvn /usr/local/bin/mvn && \echo “export M2_HOME=/usr/local/apache-maven” >> /etc/profile.d/apache-maven.sh
We start off by using the spark-slave image that already contains Hadoop, Spark and the configuration necessary for everything to find each other. We then run some system updates and install wget, curl, npm and git. We use wget to pull down Maven 3.3 (which is required by Zeppelin but isn’t available in the Ubuntu software repositories as of yet).
RUN mkdir zeppelin
RUN git clone — branch branch-0.5 https://github.com/apache/incubator-zeppelin.git zeppelinWORKDIR /tmp/zeppelin
RUN mvn clean package -Dspark.version=1.3.0 -Pspark-1.3 -Dhadoop.version=2.6.0 -Phadoop-2.4 -Pyarn -DskipTests
Here we make a directory for Zeppelin under /tmp and grab the 0.5 release from Github. We then use Maven to build it passing in the specific versions relevant for our cluster-in-a-box environment.
Zeppelin runs on port 8080 with a WebSocket connection to port 8081, so in the Dockerfile we expose those ports and add some additional configuration files.
EXPOSE 8080 8081ADD zeppelin-env.sh conf/zeppelin-env.sh
ADD zeppelin-site.xml conf/zeppelin-site.xml
ADD startup.sh startup.shCMD ./startup.sh
Finally we set the startup.sh script to run when the container is launched. The configuration files can be found below:
zeppelin-env.sh
export SPARK_HOME=/usr/local/spark
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export MASTER=yarn-client
Zeppelin-env.sh provides some environment variables that tell Zeppelin where to find Spark and Hadoop. We also set the default Spark master to use a YARN in client mode. You can find a template here.
zeppelin-site.xml
<?xml version=”1.0"?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?><configuration><property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property></configuration>
In zeppelin-site.xml we set the hostname to 0.0.0.0 so that we can bind our internal ports to external ports when we run the container and not have to worry about hostname mismatches. There is a template here if you want to explore your other options.
Finally we’re ready to build our image and run our container:
./build.sh
./run.sh
A Word on Running Zeppelin
As I mentioned briefly earlier in the piece, cluster-in-a-box uses SkyDNS for service discovery. Therefore, you’ll notice that run.sh does a little bit more than just running the container. First it locates the IP address of the Docker host, which SkyDNS and SkyDock are bound to, then it runs the container passing in the DNS server on the way. For more information on the ins-and-outs of using SkyDock for service discovery check out its README here.
Step 3: Try Zeppelin
Once Zeppelin is running you can navigate to port 9123 on your host machine to check it out.
To test things out, click Create new note then enter the newly created note. You can test that everything is working okay by writing a simple Spark program and pressing Shift+Enter to run it. Note that the first time Zeppelin runs code it may take a while as it creates its Spark context.
That’s all very well, but how about something slightly more complex? You can see an example of such a program by checking out Zeppelin’s built-in tutorial.
The tutorial showcases much more of Zeppelin’s functionality, including the ability to visualise the results of SparkSQL queries.
Once you’re comfortable with the environment you can start to do a lot more. The notebook environment is great to explore some of the peripheries of Spark, such as MLlib or GraphX if you feel inclined, and I’ll be following this post up with some more concrete examples. Feel free to extend the cluster-in-a-box solution as well — there’s definitely improvements to be made.
Appendix
More Information:
For more information check out the repository on Github.
Complete Dockerfile for Apache Zeppelin:
FROM chrishawkins/spark-slave
MAINTAINER Chris Hawkins <chris.hawkins@accenture.com># Updates & Install Prerequisites
RUN apt-get update && apt-get upgrade -y && apt-get install -y wget curl npm gitWORKDIR /tmp/# Maven 3.1 Install
RUN wget http://download.nextag.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz && tar -xzvf apache-maven-3.3.3-bin.tar.gz -C /usr/local && \
ln -s /usr/local/apache-maven-3.3.3 /usr/local/apache-maven && ln -s /usr/local/apache-maven/bin/mvn /usr/local/bin/mvn && \
echo “export M2_HOME=/usr/local/apache-maven” >> /etc/profile.d/apache-maven.shRUN mkdir zeppelin
RUN git clone — branch branch-0.5 https://github.com/apache/incubator-zeppelin.git zeppelin
WORKDIR /tmp/zeppelin
RUN mvn clean package -Dspark.version=1.3.0 -Pspark-1.3 -Dhadoop.version=2.6.0 -Phadoop-2.4 -Pyarn -DskipTestsEXPOSE 8080 8081ADD zeppelin-env.sh conf/zeppelin-env.sh
ADD zeppelin-site.xml conf/zeppelin-site.xml
ADD startup.sh startup.shCMD ./startup.sh
Footnote
At Accenture Technology Labs we’ve been using Spark and much of the rest of the stack discussed here to build the future of customer personalisation. We call it the Customer Genome. If you really enjoy working with next generation technology in interesting business contexts you might even want to join us (search for Tech Labs).
One of my colleagues in Labs, Hyon Chu, is a Data Scientist and has also posted about his use of Docker to create a Python-based data exploration environment. Give it read!