Setting up a Spark machine learning project with Scala, sbt and MLlib
In this tutorial, we will set up a Spark Machine Learning project with Scala, Spark MLlib and sbt.
sbt is an open-source build tool for Scala and Java projects, similar to Java’s Maven and Ant.
sbt requires the Java Development Kit 8 (JDK 8), so if you don’t have it installed follow this link to install.
Installing sbt:
On Ubuntu:
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823$ sudo apt-get update$ sudo apt-get install sbt
For other Linux distributions: https://www.scala-sbt.org/download.html
On Mac using homebrew:
$ brew install sbt
Creating the project
To quickly start your project we will use a Gitter8 bootstrap template. This will create the necessary folder structure and project files.
$ sbt new sbt/scala-seed.g8
You can quickly check if everything is working by changing directory into your newly created project and running sbt:
$ cd [my-project]$ sbt
Inside the sbt shell use the command run to run the template project:
$ sbt:myproject> run
This should return a simple hello message.
Adding Spark and Spark MLlib
The default template already includes a scalaTest dependency. Now we will add Spark core and Spark MLlib.
In your project folder root you can find your build.sbt configuration file.
Add the last two lines,
libraryDependencies += sparkCore,
libraryDependencies += sparkMLlib
, to include the spark core and the spark MLlib dependency.
Then we need to specify what these dependencies are in ./project/Dependencies.scala
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
lazy val sparkCore = “org.apache.spark” %% “spark-core” % “2.4.0”// https://mvnrepository.com/artifact/org.apache.spark/spark-mllib
lazy val sparkMLlib = “org.apache.spark” %% “spark-mllib” % “2.4.0”
That’s it!
Now you can code your machine learning project with Spark and MLlib in the source folder and run with sbt.
$ sbt
Inside the sbt shell use the command run.
$ sbt:myproject> run