It is much much easier to run PySpark with docker now, especially using an image from the repository of Jupyter.
When you just want to try or learn Python. it is very convenient to use Jupyter Notebook for an interactive developing environment. The same reason makes me want to run Spark through PySpark in Jupyter Notenook.
In this article (Yes, another one “Running xxx on/with Docker”), I will introduce you how to create an environment to run PySpark on Jupyter notebook easily — Read on!
PRE-REQUISITES
- Successfully running Docker in your machine
- Basic knowledge for Jupyter notebook and Docker
- Read README.md in jupyter/docker-stacks/pyspark-notebook/
Getting Started
1.Run a container
docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
You can also use -v to persist data generated in notebook of Docker container. I mounted my local host folder with the default folder of notebook in container is “/home/jovyan/work”.
“/Users/shuhsi/github” <->“/home/jovyan/work”
docker run -it --rm -p 8888:8888 -v /Users/shuhsi/github:/home/jovyan/work jupyter/pyspark-notebook
2. Connect to a Jupyter notebook
# Copy/paste this URL into your browser (if the first)
http://localhost:8888/?token=e144d004f6652ae6406a78adf894621e62fdeb1fc57d02e8
3.Try to run a sample code
import pyspark
sc = pyspark.SparkContext('local[*]')# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Conclusion
As the above shown, it is VERY easy to create an environment to run PySpark on Jupyter notebook by the following steps:
- Check PRE-REQUISITES firstly, especially the ability to run docker.
- Excuse “docker run” command to pull the image from docker pull jupyter/pyspark-notebook and run a container to start Jupyter notebook and Spark environment.
- Connect to the Jupyter notebook server and run sample code to verify the environment.
However, this environment is just to provide a Spark local mode to test some simple spark code. If you need cluster mode, you may check the reference article for more advanced ways to run Spark.
本文分享了當❶你想要用PySpark跑Spark job ❷且想在Jupyter notebook 環境進行調試時,可以採取的安裝與設定步驟來建置該環境。
主要步驟為:
- 先確認該主機可以run docker
- Run “Docker run” 的指令和參數去pull image 且啟動container
"docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook"
- 連接到該 Jupyter notebook server 並測試 PySpark sample code 來確認是否安裝與設定順利
要注意,本方法是提供Spark local mode 提供簡單的測試,可以參考延伸參考文件來建置更進階的Spark cluster環境。