Running PySpark on Jupyter Notebook with Docker

Suci Lin
2 min readSep 12, 2017

--

2020/09/13 add a docker command with volumes function

It is much much easier to run PySpark with docker now, especially using an image from the repository of Jupyter.

When you just want to try or learn Python. it is very convenient to use Jupyter Notebook for an interactive developing environment. The same reason makes me want to run Spark through PySpark in Jupyter Notenook.

Spark + Python + Jupyter Notebook + Docker

In this article (Yes, another one “Running xxx on/with Docker”), I will introduce you how to create an environment to run PySpark on Jupyter notebook easily — Read on!

PRE-REQUISITES

Getting Started

1.Run a container

docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
Run a container to start a Jypyter notebook server

You can also use -v to persist data generated in notebook of Docker container. I mounted my local host folder with the default folder of notebook in container is “/home/jovyan/work”.

“/Users/shuhsi/github” <->“/home/jovyan/work”

docker run -it --rm -p 8888:8888 -v /Users/shuhsi/github:/home/jovyan/work jupyter/pyspark-notebook

2. Connect to a Jupyter notebook

# Copy/paste this URL into your browser (if the first)
http://localhost:8888/?token=e144d004f6652ae6406a78adf894621e62fdeb1fc57d02e8

3.Try to run a sample code

Run sample code
import pyspark 
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Conclusion

As the above shown, it is VERY easy to create an environment to run PySpark on Jupyter notebook by the following steps:

  1. Check PRE-REQUISITES firstly, especially the ability to run docker.
  2. Excuse “docker run” command to pull the image from docker pull jupyter/pyspark-notebook and run a container to start Jupyter notebook and Spark environment.
  3. Connect to the Jupyter notebook server and run sample code to verify the environment.

However, this environment is just to provide a Spark local mode to test some simple spark code. If you need cluster mode, you may check the reference article for more advanced ways to run Spark.

本文分享了當❶你想要用PySpark跑Spark job ❷且想在Jupyter notebook 環境進行調試時,可以採取的安裝與設定步驟來建置該環境。

主要步驟為:

  1. 先確認該主機可以run docker
  2. Run “Docker run” 的指令和參數去pull image 且啟動container "docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook"
  3. 連接到該 Jupyter notebook server 並測試 PySpark sample code 來確認是否安裝與設定順利

要注意,本方法是提供Spark local mode 提供簡單的測試,可以參考延伸參考文件來建置更進階的Spark cluster環境。

Reference reading:

  1. https://spark.apache.org/docs/latest/running-on-mesos.html
  2. Jupyter/docker-stacks/pyspark-notebook

--

--

Suci Lin

Data Engineer, focus on stream processing and IoT. Passionate about data storytelling with data visualization and building an engineering culture.