A Simple Solution to Run Multiple PySpark Instances on Multiple Jupyter Notebooks Simultaneously.

Published in

LuckSpark

2 min readAug 28, 2018

Jupyter Notebook is a great tool for coding PySpark. I have setup my PySpark environment on Jupyter Notebook using these configurations, taken from here, in my ~/.bash_profile on my MacBook.

export PYSPARK_DRIVER_PYTHON="jupyter" 
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

So when I run 'pyspark' on my terminal, the Jupyter Notebook opens up.

I just recently found out that I cannot run 2 PySpark’s .ipynb files concurrently. It gave out the instantiation of hive.ql.metadata.SessionHive MetaStoreClient error, caused by ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db, exactly discussed here.

According to some sources, e.g., here and here, this is caused by that Spark allows it’s light weight DB named Derby to serve only 1 PySpark instance at a time. In other words, once the Derby is used by an instance of PySpark, it seems to be locked and prevent another PySpark instance from sharing it. One possible solution suggested by the discussions is to use Postgres instead of Derby.

Good news, I found out that there is yet another much more simple way to let us run multiple PySpark notebooks simultaneously, which is: just put and run the .ipynb files in separate folders. It is that simple.

Here is my scenario. I have DF1.ipynb and Spark DF Tutorial E0.ipynb under the sparkpython folder. Instead of placing these 2 .ipynb files directly under this folder, I create subfolders for each of them and put them there, as shown below. After being executed for the first time, each folder will have its own metastore_db and derby.log files created. That fixed the sharing problem, I guess.

My `sparkpython` folder has subfolders, one for each .ipynb file.

Each .ipynb file will have their own metastore_db and derby.log

I am not sure if this would be a good solution for a large-scale production environment. It just works for me under a small playground environment and save me some time from Postgres installation.

The Spark version I use is 2.3.1. My machine is macOS High Sierra. My Python 3.6 and Jupyter Notebook are installed through Anaconda 5.2.

Hope it helps.

A Simple Solution to Run Multiple PySpark Instances on Multiple Jupyter Notebooks Simultaneously.

Published in LuckSpark

Written by LC

No responses yet