Setting up a Spark machine learning project with Scala, sbt and MLlib

In this tutorial, we will set up a Spark Machine Learning project with Scala, Spark MLlib and sbt.

sbt is an open-source build tool for Scala and Java projects, similar to Java’s Maven and Ant.

sbt requires the Java Development Kit 8 (JDK 8), so if you don’t have it installed follow this link to install.

Installing sbt:

On Ubuntu:

$ echo "deb /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp:// --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt-get install sbt

For other Linux distributions:

On Mac using homebrew:

$ brew install sbt

Creating the project

To quickly start your project we will use a Gitter8 bootstrap template. This will create the necessary folder structure and project files.

$ sbt new sbt/scala-seed.g8

You can quickly check if everything is working by changing directory into your newly created project and running sbt:

$ cd [my-project]
$ sbt

Inside the sbt shell use the command run to run the template project:

$ sbt:myproject> run

This should return a simple hello message.

Adding Spark and Spark MLlib

The default template already includes a scalaTest dependency. Now we will add Spark core and Spark MLlib.

In your project folder root you can find your build.sbt configuration file.

Add the last two lines,

libraryDependencies += sparkCore,
libraryDependencies += sparkMLlib

, to include the spark core and the spark MLlib dependency.

Then we need to specify what these dependencies are in ./project/Dependencies.scala

lazy val sparkCore = “org.apache.spark” %% “spark-core” % “2.4.0”
lazy val sparkMLlib = “org.apache.spark” %% “spark-mllib” % “2.4.0”

That’s it!

Now you can code your machine learning project with Spark and MLlib in the source folder and run with sbt.

$ sbt

Inside the sbt shell use the command run.

$ sbt:myproject> run

👋 Thanks for reading!