Run PySpark and Jupyter Notebook using Docker

Balkaran Brar
Sep 20, 2019 · 3 min read

In this article, I’ll explain about basic toolset required to write standard Data Analysis programs in the containerized environment using Docker. As always, my approach is to make your programs portable and platform independent.
Let’s first understand briefly what I mean by toolset and what I’m going to package in the Docker container.

  • PySpark — PySpark programming is the collaboration of Apache Spark and Python. It is a Python API built to interact with Apache Spark. Since it’s written in Python you can use other Python modules to be an efficient Data Analyst
  • Apache Spark — It is a very popular framework for handling and working with Big Data. It is almost 100x faster than any other traditional large scale data processing frameworks
  • Jupyter Notebook — It is an open source web application mostly used by Data Analysts / Engineers to write code, mathematical equations, data visualization, etc
  • NumPy — It is a Python library used to work with multi-dimensionsal arrays, matrices, high-level mathematical functions, etc

Now let’s dig into technical details and see how to setup local environment which supports PySpark, Jupyter Notebook and NumPy. Here are step-by-step instructions:

  1. Create a new folder on your system, e.g. c:\code\pyspark-jupyter or whatever name you want to give
  2. Create a file in that folder and call it docker-compose.yaml with the content given below:
version: "3"
image: "jupyter/all-spark-notebook"
- c:/code/pyspark-data:/home/jovyan
- 8888:8888
  1. In the above file, I’m pulling an official jupyter docker image, mapping the local folder with a folder inside container and exposing container port 8888 to host port 8888. Simple, isn’t it?
  2. Now run this file using command docker-compose up and you’ll see the output similar as shown below
PS C:\code\pyspark-jupyter> docker-compose up
Recreating pyspark-jupyter_pyspark_1 ... done
Attaching to pyspark-jupyter_pyspark_1
pyspark_1 | Executing the command: jupyter notebook
pyspark_1 | [I 14:41:40.780 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
pyspark_1 | [I 14:41:41.010 NotebookApp] Loading IPython parallel extension
pyspark_1 | [I 14:41:41.049 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab
pyspark_1 | [I 14:41:41.049 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
pyspark_1 | [I 14:41:41.054 NotebookApp] Serving notebooks from local directory: /home/jovyan
pyspark_1 | [I 14:41:41.054 NotebookApp] The Jupyter Notebook is running at:
pyspark_1 | [I 14:41:41.054 NotebookApp] http://3d475598a523:8888/?token=YOUR_TOKEN
pyspark_1 | [I 14:41:41.054 NotebookApp] or
pyspark_1 | [I 14:41:41.055 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
pyspark_1 | [C 14:41:41.073 NotebookApp]
pyspark_1 |
pyspark_1 | To access the notebook, open this file in a browser:
pyspark_1 | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
pyspark_1 | Or copy and paste one of these URLs:
pyspark_1 | http://3d475598a523:8888/?token=YOUR_TOKEN
pyspark_1 | or
  1. Copy the URL and open in the browser of your choice
  2. You’ll see an instance of Jupyter Notebook running in a container. As you might have noticed that your local folder is mapped inside the container
An instance of Jupyter Notebook

Now let’s create our first notebook and work with PySpark. This is just a brief introduction as I’ll be writing separte articles about PySpark and NumPy in detail.

PySpark program running in Docker

The second notebook briefs about using NumPy

NumPy program running in Docker

Hope it helps in starting your Data Analysis Journey and using Docker to make portable programs.

Happy Learning!!!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Balkaran Brar

Written by

Cloud Architect | DevOps Professional | Passionate about ML and IoT | Python Developer

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade