Machine Learning with Jupyter using Scala, Spark and Python: The Setup
Why Jupyter Notebook?
Jupyter notebook is a tool that helps you create readable ML code and results, as you can keep code, images, comments, formulae and plots together.
It helps you keep the code, comments(in markdown) and results(as graphs/plots) together in a very presentable way. It also provides line by line code execution like scala/nodejs repls do. And autocompletion is thrown into the goodness mix as well. The presentability and ease of use of notebooks make them an ideal environment for learning a new language as well as Machine learning concepts.
Jupyter comes integrated with the python runtime (or kernel). Adding scala+spark/R support is left to the dev. So now the question arises!
Why Apache Spark? Why not stick with python or R?
Well the biggest reason is scaling horizontally. With R or python (scikit+panda) you are limited to a single machine and its ram. As such datasets of 8–100 GB are manageable but…
What if you had datasets of Terabytes? What if you need to analyse data as it comes in? What if your ML model learns continuously in production?
And that is where Spark shines. It allows you to scale by adding more machines to the cluster, just like hadoop. Also failure and resilience is taken care of. Also you can locally develop on a single machine and finally deploy the same code to your spark production cluster.
Now we will see how to setup Jupyter Notebook along with Scala and Spark
Setting up Jupyter Notebook
The easiest way to setup Jupyter Notebook is to use Anaconda (the software). Head over to https://www.continuum.io/downloads to download and install Anaconda. Anaconda includes python, jupyter and some popular data science libraries.
Anaconda installs with latest python (3.5 currently). If you got python 2.x on your system PATH then you will get Jupyter also with python 2. Why? Anaconda considers the python that is in your PATH variable. To get python 3 in Jupyter follow this SO Question.
Starting Jupyter is as easy as clicking on the Navigator icon on desktop and then launching Jupyter. You can also start Jupyter by running the below from shell:
Also I faced some issues regarding missing packages while starting. I had to install 2 packages manually, your case may vary.
pip install --upgrade jsonschema
pip install --upgrade jsonpointer
Once started open http://localhost:8888/tree to check out. You should see something like:
This shows you your home directory and associated file tree. On top right you have New which you click on to create a new notebook. Remember to browse to the folder where you want to create a notebook before you create it. Also after setting up a new Kernel I had to restart jupyter. If you don’t see the kernel available then you will need to do the same.
Btw if you like dark theme for your notebooks then check this out: https://userstyles.org/styles/98208/jupyter-notebook-dark-originally-from-ipython
Setting up R kernel
If you installed Jupyter using Anaconda then R install is just a single command. Skip the rest of this section after this command.
conda install -c r r-essentials
If you installed jupyter through pip then you need to Follow the below steps.
Install R for your OS. Go to https://cran.rstudio.com/ and choose the appropriate package for your OS. Now you need to install the R kernel for jupyter.
Just 2 steps: Run them in R console.
Step 1: Install needed packages
install.packages(c(‘repr’, ‘IRdisplay’, ‘evaluate’, ‘crayon’, ‘pbdZMQ’, ‘devtools’, ‘uuid’, ‘digest’))
Step 2: Register the kernel in Jupyter
IRkernel::installspec(user = FALSE)
More info or troubleshooting can be found on https://irkernel.github.io/installation/
Now restart your jupyter and then create a new notebook of type R. You will see that you can type R commands here and execute them line by line.
Note:- For executing a block press shift+enter (simple enter gives newline)
If you plan to use only R or Python for Machine Learning then the rest of the article has nothing for you. Stuff ends here in that case.
Setting up Scala + Spark Kernel with Apache Toree
This one is the longest part of setup.
Getting Scala and Spark
I assume that you will want to be on the latest scala as well. If you don’t have scala or spark installed then get scala from https://www.scala-lang.org/download/ and spark from http://spark.apache.org/downloads.html. After you downloaded spark extract it and then move the folder to some convenient location. Add the spark executables to your path.
# or if you installed spark somewhere else
# export PATH=”spark-home/bin:$PATH”
Caveat:- If you are on Mac do not use Homebrew. Homebrew distribution has python packages from spark missing.
Current Apache Toree distribution is compiled for scala 2.10.x and spark 1.6. A bug in Toree prevents the same distribution to run on scala 2.11 and 2.12 and spark 2.1. As such I ended up recompiling from sources.
See this issue for reference: https://issues.apache.org/jira/browse/TOREE-336. Once a release is made with this fix you will no longer need Docker/gpg and recompiling will not be needed as well. A simple pip install as below will work. And then you jump to the Final Step: Attaching Toree Kernel to Jupyter
pip install toree
Before we compile Toree we need Docker and Apache Gpg. Install docker from https://docs.docker.com/engine/installation/. You need to have admin rights on your machine for this. Potential issue with using docker is mentioned at https://github.com/docker/docker/issues/6476. Essentially it says: Cannot start container: Port has already been allocated. Just restart docker to resolve this.
Go to Compiling Toree Section, if Toree release fails due to gpg not present then come back here.
After Docker we need to install Apache gpg, gpg is needed to sign the pip package that you are gonna compile. Install gpg as
# system-package-manager install gpg
# For Mac
brew install gpg
Go to gpg conf directory and add the lines as shown in below commands
# Add the below lines to the opened file
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 ZLIB BZIP2 ZIP Uncompressed
Based on shell you are using, add the below lines to your .shell_profile (.bash_profile)or .shellrc (.zshrc) file. Tty is used to read from user which gpg requires. Since it does not know where your tty is we need to tell it.
Now generating a key for gpg run the below command.
git clone email@example.com:apache/incubator-toree.git
Find the two lines with scala and spark version and modify them as
Finally make release, this step will take some time. Your docker should be running before this step.
The previous step will output where the release files are. You need to use pip to install from them.
pip install toree-cloned-dir/dist/toree-pip/toree-0.2.0.dev1.tar.gz
Final Step: Attaching Toree Kernel to Jupyter
jupyter toree install --spark_home=spark-home --interpreters=Scala,PySpark,SparkR,SQL
Finally Start a scala+spark notebook by doing top right corner->new->Apache Toree — Scala. The notebook can execute scala+spark code and has a spark context initialised as well.
Why this Blog?
Scalable Machine Learning is at the heart of some most used services we depend on. These have been enriching our life in many ways by providing suggestions, filtering spam, preventing credit frauds etc. As such I decided to learn myself some ML. Besides and more importantly, its fun.
In the next post I will be covering few basic operations with Scala and spark. This blog post is first in a series of posts that I am planning to write while learning Machine learning. I am planning to focus mostly on Spark and python stacks. Not planning to cover the R stack. I will be putting all my jupyter notebooks on github as well for referencing them here.