Spark with Jupyter Notebook on MacOS (2.0.0 and higher)

Roshini Johri
3 min readOct 30, 2018

--

Sometimes you don’t have access to databricks and need to use a more restricted environment. Suddenly you find yourselves not using spark where things like data manipulation is so easy and you have to rely on some parts of your brain covered in cobwebs where the pandas data maniupulation archives lie (and stackoverflow..we love stackoverflow). Well I tried that for a bit and decided I need to have some spark sanity in my life.

I looked around the internet for how this can happen! Loads of people came to my rescue and I kind of concatenated the information to update for the newer versions of spark. The following is what worked for me!

$ brew install apache-spark

Assuming you have brew installed, this generally installs spark 2.3.0 and higher. To know where this is installed, do the following to get your path:

$ brew info apache-spark

That should give you something like that:

/usr/local/Cellar/apache-spark/2.3.1 is what you are looking for

The next step involves setting your SPARK_HOME. Run the following in your terminal:

$ export SPARK_HOME="/usr/local/Cellar/apache-spark/2.3.1/libexec/"

Test if you have pyspark installed correctly. To do this type pyspark in your terminal.

Expected output for running pyspark

Hopefully that worked for you and you have spark installed! Now for making this work with jupyter. I assume you have jupyter installed. If not see here.

Once installed run your python notebook:

$ jupyter notebook

Hopefully that started a new notebook for you! Type the following in a new notebook:

import os
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

Hopefully this is what it looked like for you as well:

Spark and jupyter notebook: https://gist.github.com/ololobus/4c221a0891775eaa86b0

I had guidance from a github page which needed a bit of tweeking for new versions. For the orignal link see here: github link

The alternative suggested in the page also worked for me. For that, type the following in your terminal:

$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS=notebook

Run pyspark after this. This runs the notebook automatically, and has the spark context object available to you. So you would run pyspark by running the following in your terminal after running the above two commands:

$ pyspark

This should start the jupyter notebook and you should be able to do the following:

import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("spark test").getOrCreate()
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
df.show()
#example from this link

This should give you the following:

Do let me know if that works for you. Please leave comments on the problems you face and I will try my best to help fix them!

For more spark tutorials and how to use spark with machine learning and big data look here:

ps: for writing code in your medium blog look at this:

https://help.medium.com/hc/en-us/articles/224550008-Code-blocks-inline-code

--

--

Roshini Johri

AI Engineer. I write about machine learning, AI, WIMLDS, the environment and random things I dream about.