Compile Scala on EMR

Ted Herman
3 min readApr 20, 2017

--

Scala + EMR

Well, it’s been a few years and Spark has grown to be a regular way to write cluster applications on AWS EMR. Spark was written in Scala. (Other languages are growing in popularity for Spark, see the languages slide in this talk.)

The surprise? AWS EMR, with Spark installed, does not yet have a Scala compiler. Documentation on what to do about this is scarce. Here I write down my recipe, an easy path to writing and compiling Scala for Spark.

Background: Spark ships with the “spark-shell” command, which is a REPL interface to Scala with the Spark API initialized. That’s a great way to start, encouraging interactive experiments with Spark objects and method calls. But what if you want to run a Scala program using this API? The easy answer is simply this:

$ spark-shell -i myprog.scala 

(run myprog.scala as a script through the REPL interface). Later, of course, you may want to compile myprog.scala into a JAR file and use the “spark-submit” command, since compiled programs can be more efficient.

How To: Here is a recipe for compiling Scala using the Spark API.

  1. Log in to the master machine of your EMR cluster.
  2. Fetch the current version of sbt (I’ll assume it’s a tgz file). For example:
$ wget https://github.com/sbt/sbt/releases/download/v0.13.15/sbt-0.13.15.tgz

3. Extract the sbt download, move it to /opt, place it in your PATH:

$ tar xf sbt-0.13.15.tgz
$ sudo mv sbt /opt
$ export PATH=$PATH:/opt/sbt/bin

4. Follow this tutorial to creating a “hello” program using sbt: http://www.scala-sbt.org/0.13/docs/Hello.html (this link would change with new versions of sbt). It’s just one command:

$ sbt new sbt/scala-seed.g8

There will be a flurry of messages scrolling by, be patient and wait for a prompt asking for project name. Answer “hello” to the prompt and you have a hello directory.

5. Try this command and write down the Scala version number.

$ spark-shell --version

For me it was 2.11.8; you’ll also see the Spark version number.

6. Enter your hello directory and edit the build.sbt file using your editor of choice. Change the Scala version number to be the version you discovered in the previous step. I’m not sure if this is absolutely necessary, but why take chances.

7. Try this command:

$ find / -name "spark-core*jar"

There will be many errors, but you should see a line such as “/usr/lib/spark/jars/spark-core_2.11–2.1.0.jar” in the result: from that we learn the location of all of the Spark API files needed to compile your Scala program. I’ll assume that the location is “/usr/lib/spark/jars”.

8. Within your hello directory, create a symbolic link to the Spark JAR directory, say by the command

$ ln -s /usr/lib/spark/jars lib

Now there is a lib directory in your hello project (which is actually just a pointer to the directory with all the Spark JAR files).

9. Now you need to edit your “hello.scala” program. The path to this is “src/main/scala/example/Hello.scala” which you can edit. At this point, the Hello.scala code doesn’t have any Spark API reference. Something simple is easy to add. For instance, the code might be this:

package example
import org.apache.spark.sql.SparkSession
object Hello extends Greeting with App {
println(greeting)
val spark = SparkSession
.builder
.appName("demo")
.getOrCreate()
val sc = spark.sparkContext
val data = Array(1, 2, 3, 4, 5, 6, 7)
val distData = sc.parallelize(data)
println("count = " + distData.count() )
spark.stop()
}
trait Greeting {
lazy val greeting: String = "hello"
}

10. Compile and package into a JAR file:

$ sbt compile
$ sbt package

The result of this step should display the location of the packaged JAR file of your Hello.scala program, something like this:

/home/hadoop/hello/target/scala-2.12/hello_2.12-0.1.0-SNAPSHOT.jar

11. Run the program using the “spark-submit” command:

$ spark-submit /home/hadoop/hello/target/scala-2.12/hello_2.12-0.1.0-SNAPSHOT.jar

The first steps, downloading sbt, unpacking, moving to /opt: these steps can be in a bootstrap script. Also, setting the PATH to sbt can be done by writing to a script in /etc/profile.d, so each time you log in to the cluster, sbt will be ready to use.

--

--