Your Notebook, Your Way: Integrating Open-source Notebooks with Dataproc for Flexible Data Exploration

Ayush Jain
Google Cloud - Community
4 min readNov 27, 2024

This article is co authored with Prashant Dagar, Data & Analytics leader at Google Cloud

The world of data analytics is evolving rapidly. While traditional platforms like Databricks and Google Cloud Dataproc conveniently bundle notebooks with their compute offerings, they often limit your flexibility and control. Analysts are increasingly seeking the flexibility to tap into diverse compute resources — Dataproc for one task, perhaps a different compute engine like Trino cluster for another, or even leverage on-premises Hadoop — all with just switching compute on the notebook environment.

This demand for a “cluster-agnostic” notebook experience is driven by the desire for greater control, cost optimization, and a unified data exploration environment. Imagine a data scientist seamlessly transitioning from prototyping a machine learning model on Dataproc to validating it against a massive dataset on a separate Spark cluster, all within the same familiar notebook interface.

To address this need, we’ll explore how to set up a self-managed Apache Zeppelin instance with Dataproc as the backend compute engine. This approach provides the flexibility to connect to various compute platforms as needed, offering a cost-effective and adaptable solution for data professionals. And the best part? This same setup can be easily replicated with Jupyter notebooks, giving you the freedom to choose your preferred environment.

This guide will empower you to break free from cluster-bound notebooks and embrace a more versatile and efficient approach to data exploration. Let’s dive in and unlock the potential of this powerful combination!

Integrating Self Managed Apache Zeppelin with Dataproc as the backend Compute

This guide explains how to set up a self-managed Apache Zeppelin instance on a Google Cloud VM and connect it to a Dataproc cluster for computation. This allows you to use Zeppelin’s interactive notebooks for data exploration and visualization

Functional Architecture

Step1: Create a Dataproc cluster configuration used in this guide

gcloud dataproc clusters create <cluster name> - region asia-south1 - service-account=<service account> - scopes=https://www.googleapis.com/auth/cloud-platform
- image-version 2.2-debian12
- master-machine-type n2-standard-2
- worker-machine-type n2-standard-2
- num-workers 2
- project <project name>
- enable-component-gateway - properties=spark:spark.jars.packages=io.delta:delta-spark_2.12:3.2.0,spark:spark.dataproc.enhanced.optimizer.enabled=true,spark:spark.dataproc.enhanced.execution.enabled=true - subnet=projects/pdagarproj1/regions/asia-south1/subnetworks/subnet1

Step2: Install Apache Self Managed Zeppelin Setup

  1. Create a VM on which Zeppelin will be installed
gcloud compute instances create zeppelin-vm \
- zone=asia-south1-b \
- machine-type=e2-standard-2 \
- image-family=debian-11 \
- image-project=debian-cloud \
- network=network1

2. SSH into the vm and install java and Zeppelin version 0.10.1

sudo apt-get update
sudo apt-get install openjdk-17-jdk-headless -y
wget https://dlcdn.apache.org/zeppelin/zeppelin-0.10.1/zeppelin-0.10.1-bin-all.tgz
tar -xzf zeppelin-0.10.1-bin-all.tgz
cd zeppelin-0.10.1-bin-all

Note: Dataproc cluster has zeppelin 0.10.1 hence we have tested the setup with the same version

3. Create required directories on Zeppelin VM

mkdir -p ~/.ssh
sudo mkdir -p /hadoop/spark/tmp
sudo chown -R $(whoami):$(whoami) /hadoop/spark/tmp
mkdir ~/.gcp

4. Copy JSON key associated with the service account used in this setup to .gcp

Configure ssh between the Dataproc cluster master node and Zeppelin vm. This is needed to scp the required hadoop and spark libs and configs from dataproc to Zeppelin VM.

On Dataproc master node:

ssh-keygen -t rsa -b 2048 -f ~/.ssh/id_rsa -N ""
chmod 600 ~/.ssh/id_rsa
chmod 644 ~/.ssh/id_rsa.pub
cat ~/.ssh/id_rsa.pub #open the file and copy the content of the public key

On Zeppelin VM

cd .ssh
vi authorized_keys # paste the content of id_rsa.pub from the dataproc master node
cd ..
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

On Dataproc master node — Copy hadoop and spark libraries from Dataproc to Zeppelin VM

scp -r /usr/lib/hadoop <zeppelin vm user>@<zeppelin vm ip>:<home on zeppelin vm>/hadoop
scp -r /usr/lib/hadoop/etc/hadoop <zeppelin vm user>@<zeppelin vm ip>:<home on zeppelin vm>/hadoop/etc/hadoop
scp -r /usr/lib/spark <zeppelin vm user>@<zeppelin vm ip>:<home on zeppelin vm>/spark
scp -r /usr/lib/spark/conf <zeppelin vm user>@<zeppelin vm ip>:<home on zeppelin vm>/spark/conf

5. Setup env variables on Zeppelin VM

export HADOOP_YARN_HOME=/home/admin_pdagar_altostrat_com/hadoop
export HADOOP_MAPRED_HOME=/home/admin_pdagar_altostrat_com/hadoop
export HADOOP_COMMON_HOME=/home/admin_pdagar_altostrat_com/hadoop
export HADOOP_CONF_DIR=/home/admin_pdagar_altostrat_com/hadoop/etc/hadoop
export HADOOP_HDFS_HOME=/home/admin_pdagar_altostrat_com/hadoop
export SPARK_HOME=/home/admin_pdagar_altostrat_com/spark
export HADOOP_HOME=/home/admin_pdagar_altostrat_com/hadoop
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64/
export PATH=$PATH:/home/admin_pdagar_altostrat_com/hadoop/bin
export PATH=/home/admin_pdagar_altostrat_com/spark/bin:$PATH

6. Start Zeppelin and make the following spark interpreter setting in Zeppelin UI

Binding address and port — set zeppelin.server.addr to 0.0.0.0 and zeppelin.server.port to the desired port (e.g., 8080) in $HOME/zeppelin-0.10.1-bin-all/conf/zeppelin-site.xml

<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server binding address</description>
</property>
<property>
<name>zeppelin.server.host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>zeppelin.server.port</name>
<value>8080</value>
<description>Server port.</description>
</property>

Start Zeppelin

$HOME/zeppelin-0.10.1-bin-all/bin/zeppelin-daemon.sh start

Zeppelin Configuration:

Open the Zeppelin UI

http://<public ip address of zeppelin vm>/8080

Go to spark Interpreter and add and the following variables:

Note: The key associated with the service account should be copied to zeppelin VM and its path should be assigned to “spark.hadoop.fs.gs.auth.service.account.json.keyfile”

Restart Zeppelin

$HOME/zeppelin-0.10.1-bin-all/bin/zeppelin-daemon.sh restart

Firewall Settings

Note — All these firewall settings were done in a test environment.. These firewall settings should not be applied to any production setup without proper review.

Your setup is completed. Enjoy the experience on your notebook environment!

Disclaimer: The views and opinions expressed in this blog are solely my own and do not necessarily reflect the views of my employer. Any errors or omissions are my responsibility.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Ayush Jain
Ayush Jain

Written by Ayush Jain

Building Data & AI Platforms for Scale at Google

No responses yet