Cloud Dataflow can autoscale R programs for massively parallel data processing
Lak Lakshmanan

I successfully ran R with the apache beam python sdk. It’s a bit trickier than just adding ‘apt-get install -y r-base’ and ‘pip install rpy2’ to the because those commands don’t necessarily get executed in that order, or so it seems. I had to install rpy2 from within the pipeline. I also installed a library to use on each node from within the pipeline.

pip.main([‘install’, ‘rpy2’])
from rpy2.robjects import r
r(‘if (!require(stringr)) {install.packages(“stringr”, repos=”"); library(stringr)}’)

I wrote a word length example. You can see it here.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.