Part 1 : How to add a custom library to a Jupyter Scala notebook in IBM Data Science Experience (DSX)
I have been using IBM’s Data Science Experience platform for a few months now. Its a great platform to perform data analyses using the latest tools like Jupyter Notebooks and Apache Spark. If you are at all familiar with using Jupyter Notebooks you know that its great for sharing code and quick analyses. However I have found that while the notebook is good for presenting high level code, it doesn’t work well when you have too much low level code. This is due to the fact that you quickly can get lost within the notebook and lose the main point. With this idea in mind, I thought it would be a “fun” experiment to write a few custom scala packages that could be used to de-clutter my notebooks.
This post will cover the basic environment setup you would need to create your own custom library and show a simple hello world example of how to do this. I will have a followup post that will show a more advanced example of how you can use this pattern to write a simple spark benchmark program.
Here is the list of things you need to setup on your local environment (note : I will document what I did on my MacBook Pro, however this would apply to linux environments as well, but mileage may vary)
Environment setup outline
- Setup an account in IBM DSX (datascience.ibm.com)
- Deploy a project and setup a notebook in DSX
- Install Scala
- Install the Scala compiler (SBT)
- download Hello World example from my github repo
- Compile your Jar file and load into DSX
- Test your Hello World library
1. Setup an account in IBM Data Science Experience
Browse to ibm.datascience.com and signup for a free trial. This will give you a 30 day free trial of DSX that includes Jupyter/Spark/RStudio and other great tools. You will need to provide a personal email address to setup the account. This step should take about 10–15 minutes to complete.
2. Deploy a project and setup a notebook in DSX
Once logged into DSX, first click add new project icon located in the upper right hand corner.
Fill in the name of your project, and select the defaults for the Spark and Object Storage instance. Click ‘create project’
Next Setup a Notebook in your project.
Once you type in the notebook name and select the language and spark service, click create notebook. We will now jump over to setting up your development environment on your local machine.
3. Install Scala
There are a few ways to install Scala on the Mac. You can either use brew, or download the tarball directly from the Scala download page. I prefer to download the tarball directly, and do a manual setup because I can pick the exact version of Scala I want. Here is a quick example using the terminal..
cd <YOUR_SCALA_DIR> # Path for your scala binaries
tar -zxvf scala-2.11.8.tgz
Once you have done this, then just set these values in your shell environment. On my Mac, I configure my .bash_profile.
To test your scala install simply type ‘scala’ at your command line, and you should see the scala interpreter invoked.
4. Install SBT
SBT stands for simple build tool (which I think is a misnomer, but that will have to wait for a different post). We will use this as your scala compiler. Follow the steps below on your machine to setup sbt.
tar -zxvf sbt-0.13.13.tgz
Modify your .bash_profile
To test your SBT install type “sbt” at the command line and SBT should load.
5. Clone Hello World git repo
Next we will grab a hello world example from github. We will use this example to compile the code using sbt, and create a simple jar file that we can upload into DSX. The good thing about doing it this way is that I already have the directory structure and simple build.sbt file defined so that compiling and assembling the jar should be easy.
cd <YOUR_CODE_DIR> # Path for your code
git clone https://github.com/dustinvanstee/dv-hw-scala.git
6. Compile the HelloWorld code and turn it into a skinny jar file
The Hello World code is exceedinly simple, but its important to review a few key aspects of the code so that we can reference in our Jupyter notebook.
The main concepts to reference here is the package line, the object name, and the main function definition. The package line is important because this is the library you will use when you do an import in the Juptyer notebook. The object is important, because this is what we will instantiate in the notebook, and finally we will call the main subroutine (this will be clear in the notebook).
To compile this code, we will sbt as shown.
# Path for your code, it should have build.sbt in this dir
[info] Packaging .. scala-2.10/dv-hw-scala-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Dec 12, 2016 2:04:27 PM
SBT has a number of commands available, but here we are using the compile command which compiles our code in class files. Next we use the assembly command to create a skinny jar file. SBT uses the build.sbt file in the project directory to control the compilation of the code. It performs many tasks, but the key tasks in this are automatic dependency management and defining how the jar should be built.
Note : I am using the sbt-assembly plugin for this example. This is automatically added to the project because I added it in the ./project/plugins.sbt file. Sbt-assembly provides the capability to build portable jar files. It can be used to build “uber” jar files that have all the library dependencies built int the jar, but it can also be used to build a skinny jar files that are useful in notebook environments where most of the libraries are already available. I am using the latter method here.
To verify the contents of the jar file use the jar -tvf command as shown below.
$ jar -tvf .../target/scala-2.10/dv-hw-scala-assembly-1.0.jar
273 Mon Dec 12 14:04:26 EST 2016 META-INF/MANIFEST.MF
0 Mon Dec 12 14:04:26 EST 2016 dv/
0 Mon Dec 12 14:04:26 EST 2016 dv/hw/
603 Mon Dec 12 14:04:26 EST 2016 dv/hw/HelloWorld$.class
600 Mon Dec 12 14:04:26 EST 2016 dv/hw/HelloWorld.class
As you can see, we have only the class files from our simple build. If you had built an uber jar file, you would have potentially hundreds of lines for different scala packages. We don’t want this as it will conflict withe the scala packages already installed in DSX.
7. Test your HelloWorld library
Once you have built your jar file, you need to make it available to the DSX environment. So far, the best method I have figured out is to host the jar right back on github. I tried to use the freely available object storage that comes with DSX but I couldn’t quickly figure out how to access that file within my environment. So lets get back to your DSX notebook you created in step 2.
In your first cell add this line to add your jar file (feel free to use my URL for testing)
%AddJar https://github.com/dustinvanstee/dv-hw-scala/raw/master/target/scala-2.10/dv-hw-scala-assembly-1.0.jar -f
In the next cell add these following 2 lines and run the cell.
If all was successful, you should see Hello, world! echo’d to the screen as shown below.
While this tutorial may not seem to exciting you have accomplished quite a bit in terms of your environment. You are now ready to create your own custom scala libraries that you can call from within your Jupyter scala notebooks. In my next post, see how we can build off of this simple pattern to write some custom spark benchmarking code that we can call from our notebooks!