DIY MPP Platform: Integrate JupyterHub with Apache Spark on Kubernetes

9 min readJul 10, 2022

Being a loyal customer of Apache Spark for many years, it is pretty excited to see the community keep improving the tools and ecosystems to make Spark developers’ work much easier.

Remember when I started developing Spark applications in Cloudera CDH, there was literally no one knows the concept “IDE” — all I can do was packaging Scala code in a big fat jars and upload the into CDH and schedule a job run and not being able to test my application on real data when developing. Now there are not only big players like Databricks and Snowflake, but also novel startup products such as Data Mechanics.

With all of these options in hand, being able to develop Spark application interactively on large volume of data is no longer developers’ primary concerns. Instead, IT and DevOps team’s now need to worry about being over budget when keeping a cluster 24/7 available or getting robbed by the management overhead cost charged by platform vendors (If you ever heard about DBU?). Why? As the business owner, if my data scientist can’t prove something meaningful in the short term, at least I won’t spend too much money on super computers.

In fact, building an interactive Spark environment is not that difficult. If you have experienced developers and just want the bare bones of such platform, the following guide could help you to meet the need with ONLY OPEN SOURCE tools.

Now our goal is to build use open source tool to allow analysts to run Spark dataframe or Spark SQL on the data in our data lake. We need to guarantee the following requirement at minimum:

Interactive development environment (IDE) and collaboration workspace.
Develop analysis on large volume of data — MPP, full picture, no sampling.
Autoscale up resources as needed. Autoscale down resources when no need.
Infrastructure cost only — minimum vendor management overhead charges.

Thanks to project Jupyter, we have an ideal candidate meets #1. Jupyter community provide many pre-built notebooks to meet all kinds of data analysts’ need. We can find a collection of notebooks in the “docker-stacks” repository on Github.

Meaning while, since we need a computing cluster with potentially autoscaling to meet #3, a Kubernetes cluster can do the work. It is quite popular as an open source container orchestration with the rich open source applications in Helm Chart well supported by the community. Actually, project Jupyter has already released their k8s version JupyterHub called “zero-to-jupyterhub-k8s”. In the following sections, we will build our Spark environment based on this project step by step.

Finally, our super star Apache Spark will fill the role on Massively Parallel Processing(MPP) task. Specifically, we are going to focus on PySpark in this setup as Python is popular among data analysts in today’s market.

The Architecture

Well nobody can build a house without blueprint, we need to decide how to architecture this solution first. Integrating bunch of open source tools can be problematic but we are lucky in this case. Inspired by the architecture of zero-to-jupyterhub-k8s, I ended up with the following architecture for this Spark environment.

Cloud Infrastructure

Choose your favorite cloud provider and set up following service. Note that one can optionally build all infrastructure on premises but we just create them in cloud for demonstration purpose. We use Microsoft Azure as an example in this article.

1. virtual network

2. kubernetes cluster.

3. container registry

4. storage account (Use Gen2 if need optimized performance)

5. service principle or cloud service account

It is always a good practice to segment pods by functionality and assign them to the dedicated node pool to optimize the performance and cost. In this case, I set up four node pools as following:

systempool. This is where the Kubernetes scheduler pod and api-server pod are assigned to. No other pod can be scheduled on this node pool and it usually does need to scale up.
apppool. This is the node pool for application pods, such as JupyterHub. They are most likely to be deployed through Helm. This node pool can scale up automatically depends on the applications’ resource usage.
jhubuserpool. This node pool mainly host the single-user pod for JupyterHub. This pool scale up when users arrive to login and spawn their own workspace; scale down to 0 node automatically when idle for more than 30.
sparkpool. This pool is dedicated for spark workers. In this architecture, the single-user pod are acting like the driver which submit Spark jobs to run on worker nodes. This node pool will automatically scale up when where is Spark jobs submitted by any user in the jhubuserpool.

For all the azure-cli commands, refer to this bash script https://github.com/just-modeling/jupyterhub-k8s-apache-spark/blob/main/azure.sh. One can easily translate this into Terraform code.

PySpark and Delta Lake

To guarantee both driver and workers are using the same version of Spark, Hadoop and Python, the driver container will be built on top of the worker container.

To build the worker container, choose your desired Spark version and define it in BASE_IMAGE. Use the following command to build the worker image. For example, I choose justmodeling/spark-py39:v3.2.1 as the base image. You can build your own base image with your favorite Spark version and its dependencies. Apache offers many pre-built images on its Docker Hub

docker pull justmodeling/spark-py39
docker build \
— build-arg BASE_IMAGE=justmodeling/spark-py39 \
-f pyspark-notebook/Dockerfile.spark \
-t $ACR_NAME.azurecr.io/pyspark-worker:v3.2.1 ./pyspark-notebook

To build the driver container, refer to this Dockerfile and run the following command. Where WORK_IMAGE is the image built in the step above.

docker build \
— build-arg ADLS_ACCOUNT_NAME=$ADLS_ACCOUNT_NAME \
— build-arg ADLS_ACCOUNT_KEY=$ADLS_ACCOUNT_KEY \
— build-arg ACR_NAME=$ACR_NAME \
— build-arg ACR_PULL_SECRET=$ACR_PULL_SECRET \
— build-arg JHUB_NAMESPACE=$JHUB_NAMESPACE \
— build-arg WORK_IMAGE=$ACR_NAME.azurecr.io/pyspark-worker:v3.2.1 \
— build-arg SPARK_NDOE_POOL=$SPARK_NDOE_POOL \
— build-arg SERVICE_ACCOUNT=$SERVICE_ACCOUNT \
— build-arg USER_FS_PVC=pvc-$USER_FS \
— build-arg PROJECT_FS_PVC=pvc-$PROJECT_FS \
-f pyspark-notebook/Dockerfile \
-t $ACR_NAME.azurecr.io/pyspark-notebook:v3.2.1 ./pyspark-notebook

Note that Delta Lake has been enabled in justmodeling/spark-py39:v3.2.1 but it is not necessary for most Spark jobs.

Spark UI

In many use cases, developer would like to access the Spark UI to monitor and debugging their applications. It would be very convenient to allow this UI to get proxied in JupyterHub. We shouldn’t need any other fancier features other than this one.

To do this, simply install the https://github.com/yuvipanda/jupyter-sparkui-proxy as a dependency in the driver container.

RUN pip3 install git+https://github.com/yuvipanda/jupyter-sparkui-proxy.git@master

JupyterHub on k8s

Now that both infrastructure and images are ready, we can integrate the together through JupyterHub. Here is the simple bash command to deploy JupyterHub with Helm

helm upgrade — install spark-jhub jupyterhub/jupyterhub \
— namespace $JHUB_NAMESPACE \
— version=1.2.0 \
— values config.yaml

Note that it loads configurations from a file called config.yaml. JupyterHub helm chart provides many configurable options for us to customize its deployment. I will only provide a minimum configuration in https://github.com/just-modeling/jupyterhub-k8s-apache-spark/blob/main/config.yaml. You can replace <NOTEBOOK-IMAGE> with the image built in above steps.

You might notice that there is a place holder for <HUB-IMAGE>. This is because JupyterHub is running in its own pod and well segmented from all single-user pods. One might need to customize this hub image if third party authentication need to be integrated such as Azure AD. Otherwise, you can just use jupyterhub/k8s-hub:1.2.0.

Test run with PySpark

Now it is the most exciting moment — let’s run a spark job on the environment we just built! We’d like to see the cluster scale up/down automatically given the resources we request for the job.

Once everything is deployed, you should be able to access the JupyterHub UI at http://<ip-address>/hub/spawn/<user-name>. By default we use the “dummy authentication” for demonstration purpose, you should better use PAM or LDAP in production environment.

There can be multiple projects configured in the config.yaml file when deploying JupyterHub. For demonstration purpose, I have only one here. Assuming there is no active node in the jhubuserpool, by clicking “Start”, one should expect the Kubernetes cluster scale up automatically. This might take a few minutes.

Once the single-user pod is ready, the jupyterlab UI should be there waiting for you. To test Spark jobs, simple create a test applcation as following. Note that MY_POD_IP is already an environment variables in the jupyter notebook pod. We set it as the driver host here.

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
import os# Config Spark session
conf = SparkConf()
conf.set(‘spark.app.name’,’spark-test’)
conf.set(‘spark.driver.host’, os.environ[‘MY_POD_IP’])
conf.set(‘spark.submit.deployMode’, ‘client’)
conf.set(‘spark.executor.cores’, ‘3’)
conf.set(‘spark.executor.memory’, ‘12g’)
conf.set(‘spark.executor.instances’, ‘6’)# Create Spark context
# This step takes ~5–10 mins
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Run this cell, you should see the sparkpool scaling up to provide the resource requested. The machine is these pool are 8 vcores, 32GB. We requested 6 executors which would result in 3 machines spawn up.

When resources are up running, spark session are registered. Now we can access the SparkUI to monitor our jobs. With the easy set up by jupyter proxy, one can access the Spark UI by http://<ip-address>/hub/spawn/<user-name>/sparkui. Now we spawn up a spark cluster with total of ~40GB memory successfully.

Optionally, we can test whether Hadoop or Delta Lake are configured properly. Run some simple examples to test the abfss driver to load data from Azure Data Lake and write them back as delta tables.

attom = spark.read.csv(“abfss://bc-attom-us@jhubsparkadlsa.dfs.core.windows.net/attom_us_001”, sep=”\t”, header=True)
attom.write.format(“delta”).mode(“overwrite”).save(“abfss://bc-us-lakehouse@jhubsparkadlsa.dfs.core.windows.net/attom-us”)attom_table = DeltaTable.forPath(spark, “abfss://bc-us-lakehouse@jhubsparkadlsa.dfs.core.windows.net/attom-us”)
attom_table.delete(condition = expr(“type == ‘Single Family’”))
attom_table.history().toPandas()

Conclusion

With this ground up solution, our data analysts now can easily access and leverage a spark cluster to do their work. For the IT/DevOps team, they will receive a smaller bill by the end of the month from their cloud service vendor.

However, one must keep in mind that such a solution should remain as simple, not to over engineering it that make it even closer to the IDE offered by Databricks or Snowflake. Although it is possible to extend the functionality, the benefits and savings from applying open source tools will get offset quickly by spending resources and budget to develop other ground-up fancy features. If the team do have the need for something such as Unity Catalog, just set up a Darabricks account for them. Refer to another of my blog to see how Jupyter, Databricks and Azure Synapse can share the same data catalog. By the end of the day, our goal is just to save R&D costs.

If you like the blog, please like and share your thoughts in the comment. If you have questions regarding the deployment, suggestions or want to collaborate to the solution, let me know at jiaza492@gmail.com.

Follow me on Medium for regular updates on similar topic.