Spark Tutorial — Hello World

Published in

LuckSpark

6 min readMar 12, 2018

This tutorial will guide you to write the first Apache Spark program using Scala script, a self-contained program, and not an interactive one through the Spark shell.

Main objective is to jump-start your first Scala code on Spark platform with a very shot and simple code, i.e., the real “Hello World”. The focus is to get the reader through a complete cycle of setup, coding, compile, and run fairly quickly. The code does not even use any fancy function of Spark at all.

Objectives

What you will learn includes

To create directory structure of Scala Spark program
To setup .sbt file
To setup and write some code in .scala file
Code — To print strings using print and println
To compile and run the Scala code on Spark platform.

Let’s get started.

0. Tutorial Environment

The Apache Spark 2.3.0 used in this tutorial is installed based on tools and steps explained in this tutorial. I summarize my Spark-related system information again here.

macOS High Sierra 10.13.3. I guess that the older macOS version like 10.12 or 10.11 shall be fine. This tutorial can certainly be use as guideline for other Linux-based OS too (of course with some differences in commands and environments)
Apache Spark 2.3.0, JDK 8u162, Scala 2.11.12, Sbt 0.13.17, Python 3.6.4

The directory and path related to Spark installation are based on this installation tutorial and remain intact. Just make sure that you can run pyspark or spark-shell from your Home directory, so that we could compile and run our code in this tutorial.

1. Essential Files

There are 2 files that you have to write in order to run a Scala Spark program:

A .sbt file, which is a configuration file, similar to an include file in C.
A .scala file. This is where your Scala code resides.

These files, however, must be put in a certain directory structure explained in the next section.

2. Directory Structure

First, you have to create your project’s directory, in this case named hello.
Right inside the project directory is where you put your .sbt file.
For the .scala file, you have to create directories src/main/scala inside the project directory. Then you put your .scala file there.
The figure below shows the files and directory structure.

Directory structure of Scala Spark project. The project directory is “hello”. There must be a .sbt file inside the project directory. There must be .scala file under src/main/scala inside the project directory.

Let me fast forward you to the directory structure after the Scala code is compiled, there will be 2 new directories created, target and project ,as shown in the figure below.

The new “target” and “project” directories after the compilation.

The compiled file, .jar, used to run the project is under the target directory.

3. Create directory structure

Now let’s create the directory structure discussed above using command line on Terminal.

Please note that I will create a directory named ‘scalaSpark’ under my Home directory. This directory will contain all Scala-based Spark project in the future. Thus, in this tutorial the main project named hello, is located at /Users/luckspark/scalaSpark/hello/ or ~/scalaSpark/hello/. You can pick any other location (path) as you wish and modify the path accordingly.

Here are step-by-step procedures.

Open the Terminal program.
Make sure that you are at your Home by entering the command cd
Create a directory named scalaSpark under Home directory, and create a directory named hello under the scalaSpark directory, using mkdir -p scalaSpark/hello command.
Get inside the hello directory by typing cd saclaSpark/hello.
Create the src/main/scala directory inside the hello directory by typing mkdir -p ./src/main/scala

Here is a summary of all commands above.

cd                                    #change directory to HOME
mkdir -p ./scalaSpark/hello           #create new directories
cd scalaSpark/hello                   #change directory into hello
mkdir -p ./src/main/scala             #create new directories# Note that text after # is treated as comments, so it won't be run.
# Thus, you can copy all lines including the comments and paste them on the Terminal shell prompt.

This is how it looks like when copy and paste the lines above onto the Terminal app.

On the Finder, the new directories shall appear.

4. Create SBT Configuration File

Now it is time to setup the Sbt configuration file. The location of this file is right under the project’s directory. In this case, it’s ~/scalaSpark/hello.

You can create the hello.sbt using

cd ~/scalaSpark/hello            # change directory 
touch hello.sbt                  # create hello.sbt file
open -a TextEdit hello.sbt       # open hello.sbt using TextEdit app

Add these lines to hello.sbt

name := "hello"
version := "1.0"
scalaVersion := "2.11.8"

Save the file.

Before we proceed, let’s explain the configuration in more detail.

In this file, you simply put keys and values with := in the middle, which makes the configurations pretty much self-explained.
In the first line, the name hello, will reflect the name of the .jar file created after the compilation. For example, if you change the variable name from hello to hello1234, the jar file will be named hello1234_2.11–1.0.jar rather than hello_2.11–1.0.jar.

5. Create Scala File

Now let’s create your Spark’s source code. This file is at ~/scalaSpark/hello/src/main/scala. The name is hello.scala.

Create hello.scala using these command lines

cd ~/scalaSpark/hello/src/main/scala    # change directory
touch hello.scala                       # create hello.scala
open -a TextEdit hello.scala         # open hello.scala for editing

Add these lines to hello.scala

object hello
{
    def main(args: Array[String])
    {
        print("\n\n>>>>> START OF PROGRAM <<<<<\n\n");
        
        println("Hello World.")
        
        print("\n\n>>>>> END OF PROGRAM <<<<<\n\n");
    }
}

Save the file. It shall look like this.

~/scalaSpark/hello/src/main/scala/hello.scala

Let’s examine the code

This code defines scala object hello, which has only one method, main.
It does not use any fancy feature of Spark at all.
It just prints out 3 messages, using print and println.
The semicolon at the end of the line is optional. If your finger is so familiar to typing it at the end of the line, just do it. Otherwise, you can ignore it.

That’s it. Let’s compile and run the code.

6. Compile and Run

To compile and run the project, you have to change directory back to the root of the project, which is ~/scalaSpark/hello in this case.

6. 1. Compile

Then you can compile the code using sbt package command.

cd ~/scalaSpark/hello   # change directory back project root
sbt package             # compile the code

It might take some time to compile as it has to download some dependencies.
There could also be some warnings.
It, however, should end with [success] message, as shown below.

6.2. Run

After compilation, a number of new files will be created under new directories named project and target.
Among these new files, we will use the jar file under the target/scala-2.11 directory to run the code. The file is named after project’s name, Scala version, and code’s version. In this case, hello_2.11–1.0.jar.

Run the code using this command

spark-submit ./target/scala-2.11/hello_2.11-1.0.jar

As expected, you shall see 3 lines of strings in the code.
There might be some warning, but that is fine.

That’s it. Enjoy your coding. — Peace.