Zoom Into Apache Zeppelin

Everything you Need to Get Started and More …

Published in

SFU Professional Computer Science

8 min readFeb 4, 2020

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit here.

Contributors: Aditi Shrivastava, Deeksha Kamath, Akshat Bhargava

Photo by Glenn Carstens-Peters on Unsplash

Did you know? Over 2.5 Quintillion bytes of data are created every single day from the toothpaste we use every morning to the routine coffee we drink, and it will only grow exponentially. With the evolution of Big Data and its applications, effective and efficient handling of the large amounts of data generated every day has become imperative. This has led to the explosion of several open-source applications and frameworks for handling Big Data. One such extremely versatile tool is Apache Zeppelin.

Apache Zeppelin is an interactive web-based Data Analytics notebook that is making the everyday lives of Data Engineers, Analysts and Data Scientists smoother. It increases productivity by letting you develop, execute, organize, share data code and visualize results in a single platform, i.e. no trouble of invoking different shells or recalling the cluster details.

There’s more. With Zeppelin, you can:

Integrate a wide variety of interpreters from NoSQL to Relational Databases within a single notebook.
Use multiple interactive cells for executing scripts in programming languages like Python and R with a built-in version control system.
Perform one-click visualization for almost everything with the flexibility of choosing what comes on the axes and what needs to be aggregated.

Here’s how you install Zeppelin

There are multiple ways of running Zeppelin in your system.

Let’s start with Docker

Zeppelin can be effortlessly installed through a docker. We created our docker image which can be used to install Zeppelin.

First and foremost, install Docker.

To install Docker on Mac refer to this quick tutorial: https://docs.docker.com/docker-for-mac/install/

To install Docker on Linux:

sudo apt install docker.io
sudo systemctl start docker
sudo systemctl enable docker
docker — version

Now that you have your docker set, just run this command. Use sudo if required:

docker run -it --rm -p 8181:8080 akshat4916/basic_ml_zeppelin:latest

Once the server has started successfully, go to http://localhost:8181 in your web browser. And Done!

If you are having trouble accessing the main page, please clear browser cache.

By default, the docker container doesn’t persist any file. As a result, you will lose all the notebooks that you were working on. To persist notes and logs, we can set the docker volume option.

docker run -p 8181:8080 --rm -v $PWD/logs:/logs -v $PWD/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name akshat4916/basic_ml_zeppelin:latest

Installation through Zeppelin Binaries

Even without a docker, you can install Zeppelin with minimal effort. Follow these steps and you’ll be good to go!

Download the all-interpreter binary package of the latest release of Apache Zeppelin from this page.
Extract all files from the compressed package in your desired path in a folder say ‘zeppelin’.
On Unix based platforms, run:

zeppelin/bin/zeppelin-daemon.sh start

On Windows, run:

zeppelin\bin\zeppelin.cmd

Once the server has started successfully, go to http://localhost:8080 in your web browser. And Done!
To stop the Zeppelin server, run:

zeppelin/bin/zeppelin-daemon.sh stop

For more details about the download instructions and for other ways of installing Zeppelin, refer to this page.

PS — You may face certain issues with basic python libraries(pandas, numpy,etc) while working on Zeppelin Notebook if installed using the Binary Package or while building using Maven. Use our docker for smooth installation and use!

Zeppelin Zones: The multi-language back-end Zeppelin Interpreter

Apache Zeppelin comes with some default set of interpreters which enables the users to choose their desired language/data-processing-backend. At present, the latest version of Zeppelin supports interpreters such as Scala and Python (with Apache Spark), SparkSQL, CQL, Hive, Shell, Markdown and plenty more. For more information on Supported Interpreters, refer to this page.

To initialize any interpreter, precede it with %. To change font size and other visual properties, click on the gear at the right corner of a cell and make changes as required. To run the code, hit Shift+Enter.

Apart from the above-mentioned Interpreters, Zeppelin lets you add a custom interpreter without much hassle. For example, if you want to use document-search platform Apache Solr in Zeppelin, you can add Solr Interpreter and you are ready to roll!

For step-by-step instructions on how to add a Solr interpreter to Zeppelin, refer to this page.

Features of Zeppelin

Zeppelin’s main weapon in its arsenal is its ability to allow multiple interpreters to run concurrently. So you can perform EDA on data using spark in one paragraph and produce visualizations in another paragraph. All this can be done without switching between different windows.

Again, Zeppelin is a web-based interactive data analytics tool — so we make the most use of the features available. One such remarkable feature is its inbuilt tutorials, making use of Zeppelin’s visualizations. The default pre-loaded ones include a Line/Scatter/Bar/Pie chart and any other type of visualization can be added as well. Here you can see how the embedded tutorials are accessed and executed. Notice the ease of visualization!

Handy Zeppelin Visualization. Beautiful too!

This sort of automated, making sense from columnar data is a quintessential feature of tools such as Microsoft’s Power BI or Tableau. While these tools and Zeppelin provide similar functionalities, Zeppelin has more interactive data analytics features.

As mentioned above, Zeppelin allows you to add visualizations apart from the default ones. Let’s see how we can add a new visualization, say geographical maps to Zeppelin.

At the top right corner on the Zeppelin home page, click on ‘anonymous’
Select Helium
Choose the ‘Zeppelin Leaflet’ package and click on the green ‘enable’ button.
You might have to restart the notebook for the Visualization button to appear.

For this example, we imported data stored in Cassandra table having latitude and longitude values from different locations. We exploited the Zeppelin Leaflet plugin — which asks for the columns that contain the latitude, longitude and tool-tip values. If you want to use the same dataset, download data from here and upload data to Cassandra.

After running the Cassandra SQL, you’ll see the result data in tabular format.
Change the visualization type from the buttons below the query.
Select the one with the globe icon.
Now, drag latitude and longitude columns to specific regions and specify tooltip values.
You’ll be able to see the map with tooltip on specified latitude and longitude, like the one below.

Now let’s try some Machine Learning with Zeppelin

Let’s walk through some prominent Machine Learning algorithms and how to use them with Zeppelin.

Supervised machine learning can be broadly classified into two types: Regression and Classification. The similarity between them is that both make use of some known data in a dataset to make predictions on the unknown data. While the output of a regression algorithm is continuous (or numerical value), the output of a classification algorithm is discrete (or categorical values). The algorithms below explain this in more detail along with the examples to build these machine learning models on Zeppelin:

Regression algorithms

Linear/Polynomial Regression: Linear Regression is used to predict the value of a dependent variable using one or more independent variables when the relationship between the dependent and the independent variables is linear. If there is only one independent variable affecting the dependent variable, it is called Simple Linear Regression whereas if the value of the dependent variable is affected by more than one independent variable, it is called Multiple Linear Regression. If the relationship between the dependent and the independent variable is not linear but can be represented as a polynomial equation, it is called Polynomial regression.

For more details on Linear/Polynomial Regression, refer to this page.

Support Vector Regression: The ultimate goal of a machine learning algorithm is to make the best predictions on the unknown data. In simple regression models, we try to minimize the error in predictions on our training data whereas in the case of Support Vector Regression, we try to fit the error within a certain threshold.

For more details on SVR, refer to this page.

Classification algorithms

Logistic Regression: Although the name gives you an intuition of regression, Logistic Regression is one of the most widely recognized classification algorithms. Based on the concept of probability, Logistic Regression is a predictive analysis algorithm that classifies the dependent variable into a discrete set of values.

For more details on Logistic Regression, refer to this page.

Random Forest Classification: Random Forest Classification algorithm selects a random subset of training sets and creates multiple Decision Trees. The final class of the dependent variable is decided by aggregating the votes from all decision trees.

For more details on Random Forest Classifier, refer to this page.

You can find sample Zeppelin notebooks for each of the above algorithms here. You can simply Import these notebooks in your Zeppelin and you are all set! Here is a quick tutorial on how to import these notebooks.

Is this all that Zeppelin can offer?

One of the key features of Zeppelin is its real-time Notebook sharing with your team. This makes Zeppelin a highly collaborative tool, perfect for corporate use. For detailed instructions on how to share your notebook, refer to this article.

Our Experience with Zeppelin

After spending some considerable amount of time exploring and understanding the features of Zeppelin, we realized there are a few areas of improvement. As of now, the most noticeable drawback is its stability. While using pyspark interpreter, it sometimes hangs or stops working with some random errors if there are multiple users in parallel. When using separate interpreter mode, the time for which the interpreter process is live after code was executed last time is unpredictable. This implies that you cannot predict if your dynamic objects in the interpreter’s context are still alive after some inactivity.

While these are just minute issues you might come across, it is just a matter of time that Zeppelin will resolve all these drawbacks to become one of the most powerful tools for Big Data Analytics in the near future.

We hope this blog helps. Let us know your feedback.

Cheers!