Scala 3 and Spark?
After the release of Scala 3, one of the most common questions asked by developers was: “When will we be able to write Spark jobs using Scala 3?”.
Up till now, the answer was: “Not yet” but everything changed after the release of Spark 3.2.0, which brought Scala 2.13 support. Scala 3 projects can depend on Scala 2.13 libraries, so can we finally write Spark jobs using Scala 3? The answer is yes, in theory
DISCLAIMER: Many things shown here are still under development or experimental and may not work properly in some cases.
In this post, we will show an example Spark project using Scala 3.
Usually, examples in blog posts like this one use some simple and common examples, like processing simple dataset provided as JSON files. Being in the subject of Scala 3, we decided to use some Scala 3 code itself as our dataset. Scala 3 introduces the TASTy format that defines serialized Typed Abstract Syntax Trees. TASTy files, in short, contain the view that the compiler has of your program, with all types inferred, implicits expanded etc. With Scala 3, TASTy archives are shipped together with bytecode inside jars as *.tasty files. In our example, we are going to analyze the code of the most popular libraries from the Scala 3 ecosystem.
Loading and parsing TASTy is out of the scope of this blogpost so we will just describe briefly what that part does. The more detailed explanation can be found in comments in the source code on Github
In the first step, we are pulling the .jar files from Maven Central by simply sending a http request for each Library and extracting the content of tasty files into TastyFile case class:
Then, using tasty-query we are extracting information about the base type, kind and position of the tree into another case class:
Taking it for a spin
Let’s try to run it on a single library, e.g. cats-core. With a brand new tool called scala-cli we can run locally the code directly from our repository with just one command:
What is scala-cli? It is a tool to interact with the Scala language, and in the future, hopefully, it will replace the scala command itself. It can run, compile, test and assemble your code, manage dependencies and much more, all in a fast, convenient and frictionless way.
The provided code snippet groups the results by type and prints the most popular ones so at the end of output we will get:
Most popular types:
scala.Tuple2$ : 2279
scala.Tuple2[_, _] : 1422
java.lang.Object : 1315
scala.Function1 : 1119
scala.package : 1013
<root>.symbol[A] : 969
cats.Semigroupal : 938
<root>.symbol[$anon] : 800
scala.Any : 667
cats.kernel.Semigroup : 568
The most popular types in cats-core are Tuple2, Object and Function1. These types are very common in the codebase, and they are quite general so this result looks reasonable. We imagine that our idea is correct and decide to run this algorithm using Spark.
If you take a look at our code, you can find an unexpected dependency:
followed by suspicious import:
There’s a good reason why we need both. If you try to compile the project without it, you are going to get compilation error saying:
value toDF is not a member of … — did you mean libs.coll?
The background of this problem is that Scala 2 has something called TypeTags. It’s a part of Scala 2 metaprogramming API that was completely redesigned in Scala 3. This mechanism was used in Spark to implement some implicit conversions. Here is where the mentioned library written by vincenzobaz from Scala Center comes in. It adds needed given instances (previously called implicits in Scala 2) generated using Scala 3 metaprogramming API. Note that this library is still under development and has known issues but luckily it allows us to start experimenting in our project.
With that out of the way, turning our code into a Spark application was quite straightforward. Unfortunately, we are still missing some given instances but adding those shouldn’t be too hard starting with implementation for Scala 2.
As a workaround, we needed to turn our case classes into a shape that spark-scala3 can handle by replacing Array[Byte]s with Base64 encoded Strings and Option[String]s with Strings (with a meaningful String value for None)
Running in a cluster
The next step was to create a Spark application to load, process and analyze the data, preferably inside the cluster. The processing part is quite simple. We are looking for the top 10 most popular types. We end up with the following code:
Our input consists of over 500 artifacts compiled with Scala 3 obtained from scaladex API. Our cluster contains one master and one worker node based on Docker containers from big-data-europe tweaked to use Spark 3.2 and Scala 2.13.
The most popular types are:
Our method for processing the data from TASTY itself is very naive, so the result may be a bit blurred. Nevertheless, high usage of Int, Boolean, Tuple2$ (Tuple2’s companion object), Function1 and Option indicates that our data makes some sense.
You can run the code on your machine and experiment with some other data using scala-cli. You can do it by either cloning our git repository and then running
or running it directly from the multi-file gist that we have prepared using:
The release of Spark 3.2.0 for Scala 2.13 opens up the possibility of writing Scala 3 Apache Spark jobs. However it is an uphill path and many challenges ahead before it can be confidently done in production.
We are going to keep you posted with all information about progress in this area. Follow us on our social media and look forward to new blog posts.