Zeppelin on spark cluster

Shehan Fernando
3 min readJun 8, 2020

--

Photo by Campaign Creators on Unsplash

To create your own spark/yarn cluster see the previous articles.

Install Softwears

Download zeppelin distrubution
wget https://downloads.apache.org/zeppelin/zeppelin-0.9.0-preview1/zeppelin-0.9.0-preview1-bin-all.tgz
Extract to any location.
My location(/home/shehan/zeepelin)

Install Python 3 (Do this for all nodes and this is for pyspark)

yum install -y python3Update envirement properties
nano .bash_profile
export PYSPARK_PYTHON=python3
source .bash_profile

Configure Zeppelin

Open zeppelin-env.sh & add properties
nano zeepelin/conf/zeppelin-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark/
export ZEPPELIN_PORT=9010
export MASTER=spark://master.hadoop.smf:7077
export ZEPPELIN_INTERPRETER_OUTPUT_LIMIT=1048576
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3

Open zeppelin-site.xml & add properties
<property>
<name>zeppelin.server.addr</name>
<value>master.hadoop.smf</value>
<description>Server binding address</description>
</property>

Ok We are done.

Start spark cluster & Zeppelin(make sure dfs daemon is running)

Start Spark cluster
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.sh
Start zeppelin deamon
./zeepelin/bin/zeppelin-daemon.sh start

Now you should able to see the zeppelin UI

Configure Spark interpreter

Goto the Spark interpreter and update configuration with a master URL. Once update, it will ask you to restart the interpreter.

My Spark interpreter

Once the interpreter starts, it will create a new application in spark and it will run continuously while the interpreter is active. So let’s start the interpreter.

Type sc.version and then execute it
Zeppelin Application will start

Let’s create a new note and write some scala code.

Here I have some files in the Hadoop cluster.

Data set from — https://grouplens.org/datasets/movielens/100k/
u.item - this file containes moveie Information about movies. And this is a tab separated columns

Print 2000 rows

final case class Movie(name: String, year: String)val movieFile = sc.textFile("/user/shehan/ml-100k/u.item").map(line => {val attr = line.split('|');Movie(attr(1),attr(2))})movieFile.toDF().show(2000,false)

And each action invoked on RDD will create spark jobs and we can see these jobs by clicking SPARK JOB icon.

Zeppelin runs on Yarn Cluster

Stop spark cluster
/usr/local/spark/sbin/stop-slaves.sh
/usr/local/spark/sbin/stop-master.sh
start yarn daemons
start-yarn.sh
Add/update property to zeppelin spark intepreter
master yarn
spark.submit.deployMode client

Referances

--

--