A How To: Apache Spark & Jupyter

Published in

DeveloperAcademy

4 min readApr 25, 2018

The Jupyter Notebook, formerly called IPython, is a web-based IDE for Spark development. Jupyter lets users write Scala, Python, or R code against Apache Spark, execute it in place, and document it using markdown syntax.

It is natural and logical to write code in an interactive web page. The user can write some lines of code, execute it, fix errors, and add some more code (and fix that). All of this is easier than using the cursor keys to iterate through the command history or use a text editor that does not have an interpreter and Spark connection. On top of all this, the Jupyter Notebook user does not need to perform any configuration or be concerned about the details of the Spark implementation.

Running and Installing

Running Jupyter is as easy as installing Docker and then running this one command to download the image from Docker and start it:

docker run -d -p 8888:8888 jupyter/all-spark-notebook

Then open Jupyter by navigating to localhost:8888 in your browser.

As you can see, when you click New it gives you the opportunity to write Scala, Python 2 or 3, or R code. There exist interpreters for other languages as well.

At this point, a dialogue box opens up into which you can type. Each of these boxes is called a cell. A cell can contain code to be executed or markdown to be rendered.

Construct an RDD

Just as when you use the Spark shells, when you write code in Jupyter, there is no need to set the SQLContext /SparkContext or import those statements, since that is already brought into scope automatically.

Now we can construct an RDD. You simply write this code into a cell and then click Cell/Run Cells.

val data = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

Working with the Notebook

You can change the title of the notebook by typing over the word “Untitled” at the top of the screen. There is no Save button. Jupyter saves all your changes in a .ipynb file as you work.

Add blank cells by clicking Insert.

As you work on your program, the screen will be filled with errors and run output. Click Cells/All Output/Clear to clear all output.

Markdown

Markdown is the syntax used to write README.md pages at Github. Use it to make headers, bulleted and numbered lists, and create code blocks. You can use this cheat sheet for markdown.

To change the cell from code to markdown click Cell/Cell Type/Markdown.

It might attempt to interpret as you type. To have it evaluate click Run Cells as normal

Deploying Jupyter

You should configure Nginx or Apache as a reverse proxy server in front of Jupyter if you want to run Jupyter over the public internet, since that exposes it on port 80, so there is no need to change your firewall rules. Be sure to give it a password, since Jupyter Notebooks also let you write Bash code. A hacker could do real damage to your computer if you left that open.

Jupyter is generally configured to work for one person, i.e., a local installation of Spark. But you can make it run atop a Spark Mesos cluster. Here are some instructions for that.

About the Author: Al Nelson

Al is a geek about all things tech. He’s a professional technical writer and software developer who loves writing for tech businesses and cultivating happy users. You can find him on the web at http://www.alnelsonwrites.com or on Twitter as @musegarden.

A How To: Apache Spark & Jupyter

About the Author: Al Nelson

Written by Develop Intelligence