Apache Spark setup with Gradle, Scala and IntelliJ
In my previous post I had outlined the setup process of spark with Jupyter notebook. Since writing the last post I have used Jupyter extensively for small spark programs. I felt IntelliJ/Another IDE could as well be a good choice for development.
I went ahead and created a skeleton apache spark project in scala using gradle for build. This can be imported into your favorite IDE for quick bootstrapping.
Downloading the Skeleton Project and Running
- Clone the Repository
- Run the project from command line
./gradlew clean run
- spark version,
- sum 1 to 100,
- reading a csv file and showing its first 2 rows
- average over age field in it.
Setting up Gradle build
We need to define scala library jars and spark library jars in our gradle build. Also we create a
run task for running from command line. Parameter scalaVersion was defined in gradle.properties file.
The Real Code
If you have used spark shell or Jupyter notebook for spark then you will realise that both of them initialised few variables for you from the get go.
SparkSession which acts as entry point to all things spark was pre-initialised in them. Now we don’t want to be typing it’s initialisation code everytime in a new program.
val spark: SparkSession = SparkSession.builder()
So what we will do is put this and some other init code in a
trait and then extend that trait when we want to write a spark program.
In InitSpark trait we do the following
- Initialise SparkSession as
- In the parameters you see
master("local[*]")this means that it will use as many threads as number of cores in your system
- Get SparkContext as
scand SqlContext as
readerto read files with headers and schema inference. Reader returns an object of
DataFrameReadertype which has methods to read from csv, textfiles and json.
- Define and call a private
initfunction. This function suppresses the logging by changing level to ERROR
- Define a
closefunction which we will call from our driver after our spark program is over.
Now going to our driver/main class which extends this trait. The main class will not need to initialise
spark variable. Also we have already suppressed logging so our console outputs will be readable.
Main.scala does the following
InitSparkto use the
- import spark.implicits._ for converting the row data read from csv to collection of Person
Dataset[Person]— A Dataset is a typed spark collection
- Create a
Dataset[Long]from 1 to 100 and then find it’s sum.
- Read from csv using the
readerwe earlier created in the trait.
- Computes average age of the people and prints it.
The contents of
people-example is as follows
You can find the repo on github https://github.com/faizanahemad/spark-gradle-template. Any suggestions please post them in comments
You may ask why an IDE? Why not Jupyter?
Multiple features of IDE are either absent or lousy in Jupyter, especially if you are using scala. Below are few features I found an IDE to be better at.
- Code Navigation/Go to definition
- Viewing Inline Documentation
- General speed of Editing
- Code completion with function definition