Cloud Dataflow can autoscale R programs for massively parallel data processing
Cloud Dataflow is a fully-managed service for executing data processing jobs. You don’t have to configure or manage any VMs — instead, you simply code up your data processing pipeline and lob it off to the Google Cloud Platform. The processing is autoscaled (to lots of machines) and executed in a distributed fashion on the GCP computing infrastructure.
This is very useful if you have an embarassingly parallel problem, since you only pay for the time you use those machines — running a small cluster of 10 machines for 1 hour costs the same as running a larger cluster of 100 machines for 6 minutes, and who wants to wait 60 minutes when you can have your results in six?
Cloud Dataflow currently supports pipeline programming in two languages — Java and Python (the APIs themselves are open source, in the form of Apache Beam). This is good, but what if you tend to do your analytics in R?
Can you process R scripts at massively parallel scale using Dataflow? Yes, you can, using Renjin and Java’s ScriptEngine framework to invoke R programs from Java — use this approach instead of Rjava or some other R-to-Java solution you may be familiar with because you won’t be able to install R on the Dataflow machines (remember that Dataflow is an autoscaling framework, and you don’t get to configure the VMs.)
Detailed steps for how to do this (you can see my Dataflow project that demonstrates these steps on github):
In your Java Dataflow project, to your pom.xml, add the repository for Renjin and dependencies for R and Cran (to get R libraries):
<name>bedatadriven public repo</name>
Write your R program as normal. Java’s ScriptEngine framework will let you input values to R and get the value of R variables and results. My example makes use of a R library to test whether the given set of numbers is drawn from an exponential probability distribution:
co.exp.test(x, simulate.p.value=FALSE, nrepl=2000)
In your Java pipeline, instantiate a ScriptEngine and execute the R program. The R program will be packaged along with your Java code, so that can find it from the classpath (again, absolute paths won’t work because this is an autoscaling framework and you have no way to access the local drive on Dataflow machines):
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("Renjin");
InputStream rprog = CallingRFromJava.class.getResourceAsStream("myprog.r");
Push in the variables that your R program needs (mine needs x, which is an array of doubles), run the R program, and get back the result from R to pull out the data you need. In my example, the exponential test returns a list of which the p-value is the second number. So, I have:
double inputx = c.element(); // from input
// run R program, get output from R, send to Dataflow
ListVector result = (ListVector) engine.eval(new InputStreamReader(rprog));
double pvalue = result.getElementAsDouble(1);
Execute the Dataflow pipeline as normal, whether locally or on the cloud.
See the github repo for scripts to create a Maven project, the full code for the Dataflow pipeline, and how to run it on the cloud.
P.S. A colleague suggested that there is another possible approach to running R programs in Dataflow, since the Python version of Dataflow allows users to install non-Python dependencies. You would add something like:
apt-get install -y r-base
pip install rpy2
to a setup.py and include it in the launch script for the Python job. I haven’t tried that though, and it is not clear to me how you would install R libraries using this approach. If you do try this second approach, and it works, let us know in the comments!