Sharing is Caring! Domain objects in BOTH Scala and R with GraalVM Polyglot bindings.
In any domain that goes beyond a sample project, it becomes almost inevitable that you want to use objects that accurately represent that domain. GraalVM does an adequate job of converting datastructures from R to JVM languages and back by using sensible defaults, but what do you do when the sensible defaults are not sufficient? Given that GraalVM can perform translation between its multitude of supported languages, is it possible to define a “Domain” that can be accessed by all?
This is, of course, a rhetorical question and the answer is “Yes”.
In this article I’ll demonstrate how to share domain objects between JVM languages and guest languages on the GraalVM platform. I’m using Scala domain objects (because Scala is awesome), but you could do the same with, for instance, Java or Kotlin.
(If you’re new to GraalVM Polyglot abilities, consider also reading my previous article on the subject: using GraalVM to execute R files from Scala.)
To demonstrate the problem we are trying to solve, we first need a pretend domain. Let’s do something with Weather Forecasts, because people always talk about the weather!
Creating weather forecasts is the kind of terribly complicated modelling business that could be built in R, but luckily we don’t actually need a working model for this article. So let’s just pretend we already have this awesome R functionality that creates weather forecasts, cleanly abstracted away in a separate file called
When brought into scope with R’s
source the above file will yield a
magicHappensHere function that can be called and returns a
data.frame with some weather forecast information. We can then return the result to Scala by simply making it the return of our R function:
Wow, that doesn’t look too bad! This won’t get many complaints from the Data Scientist, I reckon.
So, what’s wrong with this? What’s the problem?
I’m glad you asked, interlocutor! Let’s take a look on the Scala/JVM side of this equation, to see what the Data Engineer has to deal with:
Whoa… creating the Graal Context and Source is trivial, but look at the nasty type signature on that call to R! Let’s pick it apart for a bit:
Lists of each
data.framerow keyed by its name… That makes sense, well done Graal! It’s just too bad it’s Stringly typed, rather than actual methods on an actual class, so any typo will mess us up at runtime.
- Unknown content type of the Lists?… That’s unfortunate, we know that some rows should only contain
String, while others contain
Intbut this information is lost in conversion… We have to do a bunch of casting!
- The returned Collections are Java? That’s just sad! The polyglot representation of collections doesn’t transfer to Scala, but Scala
Listare much more powerful than their Java equivalent, so we’ll have to convert the Java equivalents!
- Every element of each
Listdoesn’t actually belong to the rest of the
List, but instead should be combined with each corresponding position in every other
Listto actually make a
WeatherReport… (The first entry of “humidities”, should be paired with the first entry of “temperatures” etc.)
Let’s see what this means when we try to use the output of this function:
I don’t know about you, but I’d feel quite uncomfortable at the thought of maintaining the code above. It’s verbose, error prone, brittle, annoying and it fails at the wrong spot if any mistakes are introduced (namely at the place of conversion, rather than the place of programming error). I wish the R function would just return a
Whoops, hold on… Wait a minute…
Why don’t we just make it do that?
The Solution: Bindings
GraalVM comes with an option that makes it possible to explicitly share instances of code across the language divide. It makes it possible to add symbols to bindings that are accessible to other languages. The Graal Context has two functions that can be used to do this in a very similar way:
In this article I will be using
getBindings, because it doesn’t require an explicit import on the side of the using language and it allows you to limit which languages you are exposing each binding to. Using
getPolyglotBindings() is almost identical from a coding perspective though, so pick the one you like best.
Using Domain objects on both sides of the language divide
This is what our Domain object looks like:
Domain is basically a factory that can be used to spawn new instances of all the domain classes that we want to share. The class
Domain itself is immutable! (As it happens, the spawned instances are too.)
WARNING: You probably don’t want to put a mutable object into bindings. If you do, this object can be mutated from any language that can reach it. Just as you don’t want multiple threads to tangle with the same mutable object, you don’t want multiple languages to access the same mutable state! (Really! Imagine having to debug race conditions across language boundaries...)
Any instance of the
Domain class provides methods to spawn new instances of the following domain case classes:
Let’s put an instance of our
Domain class into the bindings for R, so it can be accessed from the R guest language context:
Easy peasy. From R, the new object will simply be known as
Domain and its methods will be accessible like this:
We turn a new R file, that uses this binding, into our newest
And then we define the function:
Now that this is our return type, all we need to do to work with the returned
WeatherForecasts is this:
That is one very happy Data Engineer! (Don’t forget to compare with the incomplete parsing above.)
Now, let’s see the impact on the DataScientist side:
As we can see, the code has become more verbose (although it’s actually quite efficient still, if you take out all the clarifying comments I put in), but not quite as bad as in the previous solution:
In this R file, we now need to convert the
data.frame to proper
WeatherForecast instances to be added to the
WeatherForecastList we also got from
Domain. But rather than doing a Parse & Pray, as we had to do with the no-bindings solution, we can now use proper constructors that will fail with intelligible errors if we make a mistake. (Sadly still only at runtime, because this is still R.) Cleanly taking values out of the
data.frame is also better supported by its native language and we could add more convenience methods to more succinctly create the domain classes if we wanted to. If we have direct control over the function that creates the weather forecasts, we can even skip the
data.frame altogether and exclusively use
WeatherForecastList, which eliminates the extra code seen above.
The biggest advantage, though, is that we now have a very clearly defined interface. Any user can open up the
Domain.scala file to see what methods are available, what parameters they take and what things they return.
Using Bindings to provide a clean shared domain between guest languages (like R or Python) and JVM languages (like Scala, Java or Kotlin) in GraalVM is pretty easy and gets rid of a lot of ugly and fault-sensitive parsing. It also provides a crucial stepping stone for further integration of functionalities across language boundaries.
PS: I could have added a factory for each separate domain class to the bindings, instead of giving them a shared factory. This can make the code on the R side a little shorter, but creates a less clean interface (at least to my taste).