Soaring with Zeppelin

Chris Hawkins
Apache Zeppelin Stories
7 min readSep 7, 2015

--

Three Easy Steps to Setting up an Interactive Environment for Big Data Exploration and Analysis

Our team at Accenture Technology Labs recently had the good fortune to land an enormous amount of data from one of our clients with a very limited timeframe with which to work through it. Doing so was greatly accelerated by a great big bare metal cluster we had lying around and a host of containers and applications that we had pre-canned and ready to go, including a containerised Spark Notebooks environment that became indispensable for prototyping jobs (which we then industrialised into jar files and ran on the cluster using spark-submit).

The exercise left me wondering about how we might be able to make the process even more seamless in future, and while reviewing my Pocket list I rediscovered Apache Zeppelin and got very excited about the possibilities.

Step 1: Cluster-in-a-box

Although Spark works quite well running locally on a single machine, to simulate more production-like environments I have been working on set of Docker images and shell scripts which produce a completely containerised Spark cluster that can run on a single machine.

Cluster-in-a-box comprises a nine node Spark cluster (one of which also runs as the HDFS NameNode and YARN ResourceManager), a Cassandra database, Kafka message broker and SkyDNS powered service discovery mechanism. You can get started with cluster-in-a-box by cloning my Git repository, or downloading a release.

git clone -b no-zeppelin https://github.com/chrishawkins/cluster-in-a-box.git

Note that, in this instance, we are checking out the repository with the tag no-zeppelin. This is because cluster-in-a-box already contains Zeppelin, and cheating is bad.

Assuming you have Docker running, you can start the cluster by using the start-all.sh script. This will pull all the required images from my Docker Hub and boot them up.

./start-all.sh

If you don’t have a Docker environment available to you, I highly recommend CoreOS although you can do this in most Docker environments (one exception is Boot2docker which unfortunately does not play nice with the way the script configures SkyDock).

If you would like to give the cluster a quick test to make sure the nodes are finding each other, you can docker exec yourself a Spark shell for that purpose:

docker exec -ti spark-1 bin/spark-shell --master yarn-client

And then once Spark is ready:

scala> sc.parallelize(Array(1, 2, 3, 4)).sum

Step 2: Build the Zeppelin Docker Image

There are two ways to build an image in Docker. The first is to start a base image with docker run and make any desired changes to it using the shell of that container, committing them back when desired. I don’t like that method (I am not alone) and prefer the cleanness of Dockerfiles. Dockerfiles declaratively specify exactly which commands to execute to get an image to its desired state. This also usually leans to leaner Docker images without too many redundant layers.

So without further ado, I present the start of the Dockerfile for our Zeppelin instance:

FROM chrishawkins/spark-slave
MAINTAINER Chris Hawkins <chris.hawkins@accenture.com>
# Updates & Install Prerequisites
RUN apt-get update && apt-get upgrade -y && apt-get install -y wget curl npm git
WORKDIR /tmp/# Maven 3.1 Install
RUN wget http://download.nextag.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz && \
tar -xzvf apache-maven-3.3.3-bin.tar.gz -C /usr/local && \ln -s /usr/local/apache-maven-3.3.3 /usr/local/apache-maven && \ln -s /usr/local/apache-maven/bin/mvn /usr/local/bin/mvn && \echo “export M2_HOME=/usr/local/apache-maven” >> /etc/profile.d/apache-maven.sh

We start off by using the spark-slave image that already contains Hadoop, Spark and the configuration necessary for everything to find each other. We then run some system updates and install wget, curl, npm and git. We use wget to pull down Maven 3.3 (which is required by Zeppelin but isn’t available in the Ubuntu software repositories as of yet).

RUN mkdir zeppelin
RUN git clone — branch branch-0.5 https://github.com/apache/incubator-zeppelin.git zeppelin
WORKDIR /tmp/zeppelin
RUN mvn clean package -Dspark.version=1.3.0 -Pspark-1.3 -Dhadoop.version=2.6.0 -Phadoop-2.4 -Pyarn -DskipTests

Here we make a directory for Zeppelin under /tmp and grab the 0.5 release from Github. We then use Maven to build it passing in the specific versions relevant for our cluster-in-a-box environment.

Zeppelin runs on port 8080 with a WebSocket connection to port 8081, so in the Dockerfile we expose those ports and add some additional configuration files.

EXPOSE 8080 8081ADD zeppelin-env.sh conf/zeppelin-env.sh
ADD zeppelin-site.xml conf/zeppelin-site.xml
ADD startup.sh startup.sh
CMD ./startup.sh

Finally we set the startup.sh script to run when the container is launched. The configuration files can be found below:

zeppelin-env.sh

export SPARK_HOME=/usr/local/spark
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export MASTER=yarn-client

Zeppelin-env.sh provides some environment variables that tell Zeppelin where to find Spark and Hadoop. We also set the default Spark master to use a YARN in client mode. You can find a template here.

zeppelin-site.xml

<?xml version=”1.0"?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration><property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property>
</configuration>

In zeppelin-site.xml we set the hostname to 0.0.0.0 so that we can bind our internal ports to external ports when we run the container and not have to worry about hostname mismatches. There is a template here if you want to explore your other options.

Finally we’re ready to build our image and run our container:

./build.sh
./run.sh

A Word on Running Zeppelin

As I mentioned briefly earlier in the piece, cluster-in-a-box uses SkyDNS for service discovery. Therefore, you’ll notice that run.sh does a little bit more than just running the container. First it locates the IP address of the Docker host, which SkyDNS and SkyDock are bound to, then it runs the container passing in the DNS server on the way. For more information on the ins-and-outs of using SkyDock for service discovery check out its README here.

Step 3: Try Zeppelin

Once Zeppelin is running you can navigate to port 9123 on your host machine to check it out.

Zeppelin’s home screen.

To test things out, click Create new note then enter the newly created note. You can test that everything is working okay by writing a simple Spark program and pressing Shift+Enter to run it. Note that the first time Zeppelin runs code it may take a while as it creates its Spark context.

A simple Spark program executed on Zeppelin.

That’s all very well, but how about something slightly more complex? You can see an example of such a program by checking out Zeppelin’s built-in tutorial.

Zeppelin’s tutorial, showing some of the graphical capabilities of the user interface.

The tutorial showcases much more of Zeppelin’s functionality, including the ability to visualise the results of SparkSQL queries.

Once you’re comfortable with the environment you can start to do a lot more. The notebook environment is great to explore some of the peripheries of Spark, such as MLlib or GraphX if you feel inclined, and I’ll be following this post up with some more concrete examples. Feel free to extend the cluster-in-a-box solution as well — there’s definitely improvements to be made.

Appendix

More Information:
For more information check out the repository on Github.

Complete Dockerfile for Apache Zeppelin:

FROM chrishawkins/spark-slave
MAINTAINER Chris Hawkins <chris.hawkins@accenture.com>
# Updates & Install Prerequisites
RUN apt-get update && apt-get upgrade -y && apt-get install -y wget curl npm git
WORKDIR /tmp/# Maven 3.1 Install
RUN wget http://download.nextag.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz && tar -xzvf apache-maven-3.3.3-bin.tar.gz -C /usr/local && \
ln -s /usr/local/apache-maven-3.3.3 /usr/local/apache-maven && ln -s /usr/local/apache-maven/bin/mvn /usr/local/bin/mvn && \
echo “export M2_HOME=/usr/local/apache-maven” >> /etc/profile.d/apache-maven.sh
RUN mkdir zeppelin
RUN git clone — branch branch-0.5 https://github.com/apache/incubator-zeppelin.git zeppelin
WORKDIR /tmp/zeppelin
RUN mvn clean package -Dspark.version=1.3.0 -Pspark-1.3 -Dhadoop.version=2.6.0 -Phadoop-2.4 -Pyarn -DskipTests
EXPOSE 8080 8081ADD zeppelin-env.sh conf/zeppelin-env.sh
ADD zeppelin-site.xml conf/zeppelin-site.xml
ADD startup.sh startup.sh
CMD ./startup.sh

Footnote

At Accenture Technology Labs we’ve been using Spark and much of the rest of the stack discussed here to build the future of customer personalisation. We call it the Customer Genome. If you really enjoy working with next generation technology in interesting business contexts you might even want to join us (search for Tech Labs).

One of my colleagues in Labs, Hyon Chu, is a Data Scientist and has also posted about his use of Docker to create a Python-based data exploration environment. Give it read!

--

--

Chris Hawkins
Apache Zeppelin Stories

Engineering Manager at Meta . Australian in NYC, by way of Melbourne and San Francisco. Fan of brunch. Views are my own. He/him.