A Simple Solution to Run Multiple PySpark Instances on Multiple Jupyter Notebooks Simultaneously.
Jupyter Notebook is a great tool for coding PySpark. I have setup my PySpark environment on Jupyter Notebook using these configurations, taken from here, in my ~/.bash_profile
on my MacBook.
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
So when I run 'pyspark'
on my terminal, the Jupyter Notebook
opens up.
I just recently found out that I cannot run 2 PySpark’s .ipynb
files concurrently. It gave out the instantiation of hive.ql.metadata.SessionHive MetaStoreClient
error, caused by ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db
, exactly discussed here.
According to some sources, e.g., here and here, this is caused by that Spark allows it’s light weight DB named Derby
to serve only 1 PySpark instance at a time. In other words, once the Derby
is used by an instance of PySpark, it seems to be locked and prevent another PySpark instance from sharing it. One possible solution suggested by the discussions is to use Postgres
instead of Derby
.
Good news, I found out that there is yet another much more simple way to let us run multiple PySpark notebooks simultaneously, which is: just put and run the .ipynb
files in separate folders. It is that simple.
Here is my scenario. I have DF1.ipynb
and Spark DF Tutorial E0.ipynb
under the sparkpython
folder. Instead of placing these 2 .ipynb
files directly under this folder, I create subfolders for each of them and put them there, as shown below. After being executed for the first time, each folder will have its own metastore_db
and derby.log
files created. That fixed the sharing problem, I guess.
I am not sure if this would be a good solution for a large-scale production environment. It just works for me under a small playground environment and save me some time from Postgres installation.
The Spark version I use is 2.3.1. My machine is macOS High Sierra. My Python 3.6 and Jupyter Notebook are installed through Anaconda 5.2.
Hope it helps.