How to get Spark running locally with notebooks
The goal here is show you how to set up your machine to run Spark using notebooks.
My motivation to do that was the fact that, a time ago, I did not know a lot around that, and when I first started to learn Spark I found hard to set up for just run a simple “hello world” sample. I still don’t know that much, but now I feel more comfortable to share this post, and hopefully I can help someone to get started smoothly than me at the first days :)
Getting Spark working for Python
The first step is to install the pip packages ‘pyspark’ and ‘jupyter’ on the machine.
sudo pip install pyspark jupyter
After that, we need to set the `SPARK_HOME` environment variable on our “~/.bashrc” with the path where spark installation lives. So in my case is `/usr/local/lib/python2.7/dist-packages/pyspark`
Also, as last step, we need to configure two others PySpark environment variables, so we can change the default behavior of ‘pyspark’ command.
export PYSPARK_DRIVER_PYTHON=”jupyter”
export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”
That way, when we type ‘pyspark’ on the command line and hit ENTER, instead of opening the shell, Jupyter Notebook will be opened (just like that).
Getting Spark working for Scala
The easiest way I’ve found was using the Spark Notebook, with the Docker approach.
First I’ve tried the docker image with the most recent version of all components:
docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.1.1-hadoop-2.8.0-with-hive
It didn’t worked. So I tried with another version and then, bang! It worked just fine.
docker pull andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive
Then, to use the image, you just need run the following:
docker run -p 9001:9001 -v /home/wesley/notebooks/:/opt/docker/notebooks/host andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive
Quick tip: notice the “-v” flag set on the command. Change it according to your user and your preferences, or just remove it. While it is not needed to work, it is the way to get the notebooks built synced to your machine, otherwise it will only be available on the running docker container.
Just to keep note:
- I am using Ubuntu 16.04.
- The jupyter version installed was 1.0.0
- The pyspark version installed was 2.2.0
- My `docker -v` command shows “Docker version 17.05.0-ce, build 89658be”
- `java -version` shows:
openjdk version “1.8.0_131”
OpenJDK Runtime Environment (build 1.8.0_131–8u131-b11–2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
At the time of writing, I’ve just reinstalled the system due to an upgrade I’ve done on my notebook, and so I needed to configure everything again. So I just took the advantage that the steps were pretty fresh on my mind 😅
I tried to keep it simple, so that’s it :D
Thanks for your time reading.