Apache Spark setup with Gradle, Scala and IntelliJ

In my previous post I had outlined the setup process of spark with Jupyter notebook. Since writing the last post I have used Jupyter extensively for small spark programs. I felt IntelliJ/Another IDE could as well be a good choice for development.

I went ahead and created a skeleton apache spark project in scala using gradle for build. This can be imported into your favorite IDE for quick bootstrapping.

Prerequisites

Downloading the Skeleton Project and Running

  • Clone the Repository
git clone https://github.com/faizanahemad/spark-gradle-template.git
  • Run the project from command line
./gradlew clean run
output of running on command line

Output shows

  • spark version,
  • sum 1 to 100,
  • reading a csv file and showing its first 2 rows
  • average over age field in it.

Setup

Setting up Gradle build

We need to define scala library jars and spark library jars in our gradle build. Also we create a run task for running from command line. Parameter scalaVersion was defined in gradle.properties file.

The Real Code

If you have used spark shell or Jupyter notebook for spark then you will realise that both of them initialised few variables for you from the get go. SparkSession which acts as entry point to all things spark was pre-initialised in them. Now we don’t want to be typing it’s initialisation code everytime in a new program.

val spark: SparkSession = SparkSession.builder()
.appName("Spark example")
.master("local[*]")
.config("option", "some-value")
.getOrCreate()

So what we will do is put this and some other init code in a trait and then extend that trait when we want to write a spark program.
In InitSpark trait we do the following

  • Initialise SparkSession as spark
  • In the parameters you see master("local[*]") this means that it will use as many threads as number of cores in your system
  • Get SparkContext as sc and SqlContext as sqlContext
  • Define reader to read files with headers and schema inference. Reader returns an object of DataFrameReader type which has methods to read from csv, textfiles and json.
  • Define and call a private init function. This function suppresses the logging by changing level to ERROR
  • Define a close function which we will call from our driver after our spark program is over.

Now going to our driver/main class which extends this trait. The main class will not need to initialise spark variable. Also we have already suppressed logging so our console outputs will be readable.
Main.scala does the following

  • Extends InitSpark to use the spark and reader references.
  • import spark.implicits._ for converting the row data read from csv to collection of Person Dataset[Person] — A Dataset is a typed spark collection
  • Create a Dataset[Long] from 1 to 100 and then find it’s sum.
  • Read from csv using the reader we earlier created in the trait.
  • Computes average age of the people and prints it.

The contents of people-example is as follows

You can find the repo on github https://github.com/faizanahemad/spark-gradle-template. Any suggestions please post them in comments

You may ask why an IDE? Why not Jupyter?

Multiple features of IDE are either absent or lousy in Jupyter, especially if you are using scala. Below are few features I found an IDE to be better at.

  • Code Navigation/Go to definition
  • Viewing Inline Documentation
  • General speed of Editing
  • Refactoring
  • Code completion with function definition