Zeppelin on spark cluster

3 min readJun 8, 2020

To create your own spark/yarn cluster see the previous articles.

Install Softwears

Download zeppelin distrubution
wget https://downloads.apache.org/zeppelin/zeppelin-0.9.0-preview1/zeppelin-0.9.0-preview1-bin-all.tgzExtract to any location.
My location(/home/shehan/zeepelin)

Install Python 3 (Do this for all nodes and this is for pyspark)

yum install -y python3Update envirement properties
nano .bash_profile
export PYSPARK_PYTHON=python3
source .bash_profile

Configure Zeppelin

Open zeppelin-env.sh & add properties
nano zeepelin/conf/zeppelin-env.shexport JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/
export HADOOP_CONF_DIR=/home/shehan/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark/
export ZEPPELIN_PORT=9010
export MASTER=spark://master.hadoop.smf:7077
export ZEPPELIN_INTERPRETER_OUTPUT_LIMIT=1048576
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
Open zeppelin-site.xml & add properties
<property>
  <name>zeppelin.server.addr</name>
  <value>master.hadoop.smf</value>
  <description>Server binding address</description>
</property>

Ok We are done.

Start spark cluster & Zeppelin(make sure dfs daemon is running)

Start Spark cluster
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-slaves.shStart zeppelin deamon
./zeepelin/bin/zeppelin-daemon.sh start

Now you should able to see the zeppelin UI

Configure Spark interpreter

Goto the Spark interpreter and update configuration with a master URL. Once update, it will ask you to restart the interpreter.

Once the interpreter starts, it will create a new application in spark and it will run continuously while the interpreter is active. So let’s start the interpreter.

Type sc.version and then execute it

Let’s create a new note and write some scala code.

Here I have some files in the Hadoop cluster.

Data set from — https://grouplens.org/datasets/movielens/100k/

u.item - this file containes moveie Information about movies. And this is a tab separated columns

Print 2000 rows

final case class Movie(name: String, year: String)val movieFile = sc.textFile("/user/shehan/ml-100k/u.item").map(line => {val attr = line.split('|');Movie(attr(1),attr(2))})movieFile.toDF().show(2000,false)

And each action invoked on RDD will create spark jobs and we can see these jobs by clicking SPARK JOB icon.

Zeppelin runs on Yarn Cluster

Stop spark cluster
/usr/local/spark/sbin/stop-slaves.sh
/usr/local/spark/sbin/stop-master.shstart yarn daemons
start-yarn.shAdd/update property to zeppelin spark intepreter
master yarn
spark.submit.deployMode client

Referances

Spark Interpreter for Apache Zeppelin

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python…

zeppelin.apache.org