Creating a Spark Project with SBT, IntelliJ, sbt-spark-package, and friends

This blog post will show you how to create a Spark project in SBT, write some tests, and package the code as a JAR file. We’ll start with a brand new IntelliJ project and walk you through every step along the way.

After you understand how to build an SBT project, you’ll be able to rapidly create new projects with the sbt-spark.g8 Gitter Template.

The spark-pika project that we’ll create in this tutorial is available on GitHub.

Add sbt-spark-package

sbt-spark-package is the easiest way to add Spark to a SBT project, even if you’re not building a Spark package. Add the package in the project/plugins.sbt file.

resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")

Select the components of Spark will be used in your project and the Spark version in the build.sbt file.

sparkVersion := "2.2.0"
sparkComponents ++= Seq("sql")

Set the Scala and SBT versions

Spark works best with Scala version 2.11.x and SBT version 0.13.x. Update the build.sbt file with Scala 2.11.12.

scalaVersion := "2.11.12"

Update the project/build.properties file with SBT version 0.13.17.

sbt.version = 0.13.17

Spark doesn’t work with Scala 2.12 (this issue tracks Spark’s progress to adding Scala 2.12 support). SBT 1.x uses Scala 2.12, so it’s best to stick with SBT 0.13.x when using Spark.

Add the SparkSession

We’ll wrap the SparkSession in a trait, so it’s easily accessible by our classes and objects.

Scala follows the Java convention of deeply nesting code in empty directories. We’ll put the SparkSessionWrapper trait in the following directory structure.

spark-pika/
src/
main/
scala/
com/
github/
mrpowers/
spark/
pika/
SparkSessionWrapper.scala

The directory structure is a backwards version of the code URL: https://github.com/MrPowers/spark-pika.

The Spark codebase also follows these conventions. You’ll see imports like this when writing Spark.

import org.apache.spark.sql.DataFrame

IntelliJ does a great job making this directory structure look less ridiculous.

Here’s what the SparkSessionWrapper code looks like.

package com.github.mrpowers.spark.pika

import org.apache.spark.sql.SparkSession

trait SparkSessionWrapper {

lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark pika")
.getOrCreate()
}

}

When a class is extended with the SparkSessionWrapper we’ll have access to the session via the spark variable. Starting and stopping the SparkSession is expensive and our code will run faster if we only create one SparkSession.

The getOrCreate() method uses existing SparkSessions if they’re present. You’ll typically create your own SparkSession when running the code in the development or test environments and use the SparkSession created by a service provider (e.g. Databricks) in production.

Write some code

Let’s create a Tubular object with a withGoodVibes() DataFrame transformation that appends a chi column to a DataFrame.

package com.github.mrpowers.spark.pika

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

object Tubular {

def withGoodVibes()(df: DataFrame): DataFrame = {
df.withColumn(
"chi",
lit("happy")
)
}

}

Read this blog post if you’d like more background information on custom DataFrame transformations.

Let’s run sbt console and try out our code.

$ sbt console
> val df = List("sue", "fan").toDF("name")
> import com.github.mrpowers.spark.pika.Tubular
> val betterDF = df.transform(Tubular.withGoodVibes())
> betterDF.show()
+----+-----+
|name| chi|
+----+-----+
| sue|happy|
| fan|happy|
+----+-----+

Checking if code functions in the console is more time consuming than simply writing a test.

Add some tests

Let’s add spark-fast-tests and scalatest to the build.sbt file so we can add some tests.

libraryDependencies += "MrPowers" % "spark-fast-tests" % "2.2.0_0.5.0" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"

Create a src/test/scala/com/github/mrpowers/spark/pika/TubularSpec.scala file for the test code.

package com.github.mrpowers.spark.pika

import org.scalatest.FunSpec

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

import com.github.mrpowers.spark.fast.tests.DataFrameComparer

class TubularSpec
extends FunSpec
with SparkSessionWrapper
with DataFrameComparer {

import spark.implicits._

describe("withGoodVibes") {

it("appends a chi column to a DataFrame") {

val sourceDF = List("sue", "fan").toDF("name")

val actualDF = sourceDF.transform(Tubular.withGoodVibes())

val expectedSchema = List(
StructField("name", StringType, true),
StructField("chi", StringType, false)
)

val expectedData = List(
Row("sue", "happy"),
Row("fan", "happy")
)

val expectedDF = spark.createDataFrame(
spark.sparkContext.parallelize(expectedData),
StructType(expectedSchema)
)

assertSmallDataFrameEquality(actualDF, expectedDF)

}

}

}

Run the sbt test command and verify that the test is passing.

Let’s use spark-daria to refactor this test and make the code more concise.

Cleaning up the test

spark-daria defines a createDF method that’s terse like toDF() while allowing for full control of the DataFrame that’s created like createDataFrame(). Read this blog post for a full description on different ways to create DataFrames in Spark.

Add spark-daria to the build.sbt file.

libraryDependencies += "mrpowers" % "spark-daria" % "2.2.0_0.12.0" % "test"

Let’s refactor the test file with createDF.

package com.github.mrpowers.spark.pika

import org.scalatest.FunSpec

import org.apache.spark.sql.types.StringType

import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._

class TubularSpec
extends FunSpec
with SparkSessionWrapper
with DataFrameComparer {

describe("withGoodVibes") {

it("appends a chi column to a DataFrame") {

val sourceDF = spark.createDF(
List(
"sue",
"fan"
), List(
("name", StringType, true)
)
)

val actualDF = sourceDF.transform(Tubular.withGoodVibes())

val expectedDF = spark.createDF(
List(
("sue", "happy"),
("fan", "happy")
), List(
("name", StringType, true),
("chi", StringType, false)
)
)

assertSmallDataFrameEquality(actualDF, expectedDF)

}

}

}

spark-daria’s createDF function simplifies the import statements, makes the code more readable, and shrinks the test file from 1,008 to 925 characters. I find spark-daria to be essential for all of my projects.

Building the JAR file

Running sbt package will generate a file named spark-pika_2.11–0.0.1.jar.

We should follow the spark-style-guide and include the Spark version in the JAR file name, so it’s easier for JAR file consumers to know what version of Spark they should be using.

Update the build.sbt file as follows.

artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
artifact.name + "_" + sv.binary + "-" + sparkVersion.value + "_" + module.revision + "." + artifact.extension
}

Now sbt package will generate a file named spark-pika_2.11–2.2.0_0.0.1.jar.

There are a lot of complexities related to packaging JAR files and I’ll cover these in another blog post.

Next Steps

Now that you know how to create SBT projects with Spark, you can use the sbt-spark.g8 Gitter Template to bootstrap new projects.

In the next blog post, we’ll cover advanced SBT tactics for efficiently running tests, generating JAR files, and managing dependencies.