How to get Spark running locally with notebooks

Wesley Batista
2 min readSep 16, 2017

--

https://spark.apache.org/

The goal here is show you how to set up your machine to run Spark using notebooks.

My motivation to do that was the fact that, a time ago, I did not know a lot around that, and when I first started to learn Spark I found hard to set up for just run a simple “hello world” sample. I still don’t know that much, but now I feel more comfortable to share this post, and hopefully I can help someone to get started smoothly than me at the first days :)

Getting Spark working for Python

The first step is to install the pip packages ‘pyspark’ and ‘jupyter’ on the machine.

sudo pip install pyspark jupyter

After that, we need to set the `SPARK_HOME` environment variable on our “~/.bashrc” with the path where spark installation lives. So in my case is `/usr/local/lib/python2.7/dist-packages/pyspark`

Also, as last step, we need to configure two others PySpark environment variables, so we can change the default behavior of ‘pyspark’ command.

export PYSPARK_DRIVER_PYTHON=”jupyter”
export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”

That way, when we type ‘pyspark’ on the command line and hit ENTER, instead of opening the shell, Jupyter Notebook will be opened (just like that).

Getting Spark working for Scala

The easiest way I’ve found was using the Spark Notebook, with the Docker approach.

First I’ve tried the docker image with the most recent version of all components:

docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.1.1-hadoop-2.8.0-with-hive

It didn’t worked. So I tried with another version and then, bang! It worked just fine.

docker pull andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive

Then, to use the image, you just need run the following:

docker run -p 9001:9001 -v /home/wesley/notebooks/:/opt/docker/notebooks/host andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive

Quick tip: notice the “-v” flag set on the command. Change it according to your user and your preferences, or just remove it. While it is not needed to work, it is the way to get the notebooks built synced to your machine, otherwise it will only be available on the running docker container.

Just to keep note:

  • I am using Ubuntu 16.04.
  • The jupyter version installed was 1.0.0
  • The pyspark version installed was 2.2.0
  • My `docker -v` command shows “Docker version 17.05.0-ce, build 89658be”
  • `java -version` shows:
    openjdk version “1.8.0_131”
    OpenJDK Runtime Environment (build 1.8.0_131–8u131-b11–2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

At the time of writing, I’ve just reinstalled the system due to an upgrade I’ve done on my notebook, and so I needed to configure everything again. So I just took the advantage that the steps were pretty fresh on my mind 😅

I tried to keep it simple, so that’s it :D

Thanks for your time reading.

--

--