A Detailed Walkthrough

Build Your Own Big Data Ecosystem — Part 2

Setting up Jupyter Notebook with Spark on Kubernetes

Ali Abbas
Geek Culture

--

Jupyter Notebook is the go-to IDE for all data science-related activities. It is the tool of choice for the majority of Data Scientists and Data Engineers across the globe. Jupyter notebooks come with very good support for running with Spark and will no doubt be the IDE for our ecosystem as well.

In the first part of this series, we looked at how to set up a running instance of Spark on Kubernetes. In this post, let us explore setting up Jupyter notebook in the same Kubernetes cluster and hook it up with our Spark instance.

This blog in a nutshell

We will build upon the previous post (Part 1) and will perform the following actions to have Jupyter Notebook integrated with our cluster.

  • Create a storage class for Azure File Service to store our notebooks
  • Create a Persistent volume claim for attaching this storage class to our driver pod
  • Create a deployment to spin up replica sets, driver pods, and a service that facilitates communication between driver and executor pods.
  • Create a Spark context from the notebook UI and see the executor pods running in real-time.

Pre-requisites

I have used Azure’s Cloud Shell to run most of my commands and am also using an AKS cluster. I will be using Azure Files as our persistent data store but you can choose one which suits your needs.

Jupyter + Spark + Kubernetes — Some Fundamentals

The execution model of Jupyter + Spark on Kubernetes
The execution model of Jupyter + Spark on Kubernetes

Running a Jupyter notebook means that we will not be issuing spark-submit commands directly on the cluster but will instead create a spark context from inside of the k8s cluster and will then issue the analytics queries. This operational model is known as running Spark in client mode.

Here we explicitly spin up our Driver pod through a separate docker image pre-installed with a Jupyter notebook. We then create a Spark context specifying the number of Pods, their memory and compute requirement, executor docker image location, and other configurations. Spark context does the job of talking to Kubernetes scheduler to request the required Pods and also talks to the Pods for executing individual commands on them.

A Pod by nature is stateless and ephemeral. We cannot guarantee that the pod where we run our Spark driver and Jupyter notebook would be available always. Hence we need a mechanism where we connect the Pod to a store and save the state of our notebook so that even if the Pod restarts due to intermittent failures, we do not lose our Notebooks and code. To achieve this, we create a Persistent Volume Claim in Kubernetes and connect it to an Azure file share.

Step 1 — Creating the Spark Driver image pre-installed with Jupyter

Use the following Dockerfile to generate the Spark Driver container with Jupyter notebooks.

The python version of the driver and the executor spark pods have to be the same. The Spark image that we built in Part 1 using Spark version 3.0.1 comes with Python 3.7 hence we take Python 3.7 as the base image and then install all the required dependencies on it.

The default installation of Jupyter comes with an authentication mechanism of a specific auth token which we have to use in conjunction with the UI endpoint. We modify that behavior to ask for a password and set a default password in the Dockerfile. In this way, we don’t have to obtain the token every time the driver pod is generated or restarted.

Dockerfile for deploying Jupyter Notebook with Spark and Python

Create a requirements.txt file in the same folder where you would be saving the docker file. Add all the required python libraries that you want to install as follows.

jupyter
jupyterlab
matplotlib
numpy
pandas

Create the docker image from the above docker file using the below command.

docker build --tag="<your_container_registry>/jupyter-spark:pybase" .

Step — 2 Creating a Kubernetes Storage Class

Following is the YAML file to create a Storage Class corresponding to the Azure file store.

Run the following command to create the storage class.

kubectl apply -f azure_sc.yaml

Step — 3 Creating a Persistent Volume Claim

We create a persistent volume claim within the cluster using the below yaml file and specify the storage class created previously in it. We would then associate this PVC with the driver pod that we create as part of the overall deployment.

Run the following command to create the storage class

kubectl apply -f azure_sc.yaml

Step — 4 Create a deployment for the pod and the service.

Finally we create a Kubernetes deployment which creates a driver pod running Jupyter in it and a service which will act as a communication layer between the driver and the executor pods.

Run the deployment using the following command.

kubectl apply -f driver_deployment.yaml

Seeing it all come together

Once you create the deployment, you will see that there would be a pod created for our driver. Running kubectl get all should give the following output.

$ kubectl get all -n sparkNAME                                         READY   STATUS    RESTARTS   AGEpod/my-notebook-deployment-6677b6975-9dxxd   1/1     Running   0          3m30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/my-notebook-deployment ClusterIP None <none> 29413/TCP 3m30sNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/my-notebook-deployment 1/1 1 1 3m31sNAME DESIRED CURRENT READY AGEreplicaset.apps/my-notebook-deployment-6677b6975 1 1 1 3m32s

Run the below port-forward command in a new terminal to connect your local system to the deployed Jupyter Notebook

kubectl port-forward -n spark deployment.apps/my-notebook-deployment 8888:8888

Now point your browser to http://localhost:8888 and you would be greeted with the Jupyter notebook login page. Add password “jupyter” (without quotes)

Jupyter Login Page

Running a Spark Job from the notebook

Now that we have our notebook environment all set up, we proceed with the last and the most important piece of running a Spark job on the cluster through the notebook.

Create a new notebook in the UI and add the following code in the first cell.

Update line 14 with the location of your spark executor image (for more details refer Part 1) and run the cell.

Here we set up a spark context on our cluster by providing the relevant configuration. The command on line #29 in above gist will create the executor pods.

SparkSession.builder.config(conf=sparkConf).getOrCreate()

After setting spark context you should see that the executor pods have started running. We have 3 executor pods in total as we had provided the count as 3 while setting up the spark context.

Now our Spark cluster is ready to run jobs using this Spark context.

To stop the executor pods, run the below command

spark.stop()

With this, we wrap up our Part -2. We now have an up and running Spark cluster with a notebook interface to interactively run our jobs in the cluster.

In the next part, we will explore how to access data from a Data Lake and perform Data Analytics from our notebook.

--

--

Ali Abbas
Geek Culture

Architect by role, developer by heart! I help organisations get best of Big Data on Cloud