Sparklyr (R interface for Spark) and Kerberos on Cloudera

Below is a way (but probably not the only way) to set up and run Sparklyr on a Cloudera cluster that has Kerberos enabled. This is not, in fact, that difficult to set up, but it seems as though there is little documentation on how to do it.

A quick note on Sparklyr — Sparklyr is an R interface for Spark that is freely available from Rstudio (http://spark.rstudio.com/). It is arguably a better implementation of Spark in R than SparkR — it has a dplyr backend and allows you to run SparkSQL in R using a familiar DBI interface. If you really want to use Spark and R, this is probably the best way to go.

Ingredients

Installing R and RStudio (skip if you’ve already done this)

This is mostly for completeness and maybe for those that have never done this before. Before we can use sparklyr, we need to install R. It’s pretty common to use RStudio as an IDE in the R world, so I’ll go over how I installed this as well.

We typically want to install both R and RStudio on a gateway or edge node. In the Hadoop world, this is typically a managed host that users connect to in order to do their work and may run client applications (like RStudio). It usually contains configuration files for Hadoop components (like Spark, Hive, HDFS, etc) but is not running the daemons themselves. This is in contrast to master nodes which run critical services like the HDFS Namenode, YARN Resource Manager, or the Hive Metastore, or worker nodes which run processes like HDFS Datanodes, YARN Node Managers, or Impala Daemons.

In the Cloudera world, this process is easiest if we have a Cloudera Manager-managed node that operates as our gateway node. This is because we can tell Cloudera Manager to push client configuration files to this node and keep them synchronized as we update the cluster. It also (if you’re using parcels, which you should if possible) pushes the parcels to the host which means you won’t have to worry about managing the necessary JARs for running Spark, Hive, HDFS, etc on your own. It just makes it easy. My gateway node has 3 “roles” on it — the HDFS Gateway, Hive Gateway, and Spark Gateway. You will want all of these roles on your node at a minimum for sparklyr to work as expected.

Gateway with HDFS, Hive, and Spark Gateway roles

Once you know which node you’re going to use and have it set up as above, it’s time to install R and RStudio. Huge caveat here — I’m not an R or RStudio expert and I expect there are probably much better ways to accomplish this task — however this is how I did it and it worked for me. If you have experience here by all means, do it the right way

Here are all the commands I had to run in order on my AWS EC2 RHEL 6.6 (yes it’s end of life, I didn’t realize I was using this AMI):

yum-config-manager —- enable rhui-REGION-rhel-server-extras rhui-REGION-rhel-server-optional
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum update
sudo yum clean all
sudo yum install libcurl-devel lapack gcc-gfortran tetex texinfo libicu
sudo rpm -Uvh http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm
sudo yum install http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm
sudo rpm -Uvh http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm
sudo rpm -Uvh http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-14.el6.x86_64.rpm
sudo yum install R
sudo wget https://download2.rstudio.org/rstudio-server-rhel-1.0.44-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-1.0.44-x86_64.rpm

Assuming all that worked, you should have R installed as well as RStudio which should be running on port 8787 of the server you installed it on:

RStudio is up

By default, you can login to RStudio with a user account and password that exists on your host. I’m sure there are ways to authenticate differently, but for the sake of this example I left it as it.

Set up R to know which Spark to use

Assuming you used Cloudera Manager and have the gateway roles listed above installed, there isn’t a ton to do here. We need to modify our Renviron file to point to the correct Spark location. To do so, we need to first find our Renviron file:

[ec2-user@ip-10-0-0-29 ~]$ sudo locate Renviron
/usr/lib64/R/etc/Renviron
/usr/lib64/R/library/base/html/readRenviron.html
[ec2-user@ip-10-0-0-29 ~]$

Mine is located in /usr/lib64/R/etc/Renviron. We need to add just a single line to this file:

sudo vim /usr/lib64/R/etc/Renviron
SPARK_HOME=${SPARK_HOME-'/opt/cloudera/parcels/CDH/lib/spark/'}

Because this we used parcels and this is a Cloudera Manager-managed host, we can point to the above location. This gives us the correct JARs, the correct binaries, and all of the configuration files (through the Gateway roles we deployed) that we will need for sparklyr to work.

Create a Kerberos keytab for use with Spark

Before we can use sparklyr, we need a keytab that will allow us to authenticate and actually do anything. To do that, on your gateway host:

  ktutil
ktutil: addent -password -p bkvarda@CLOUDERA.INTERNAL -k 1 -e rc4-hmac
Password for bkvarda@CLOUDERA.INTERNAL: [enter your password]
ktutil: addent -password -p bkvarda@CLOUDERA.INTERNAL -k 1 -e aes256-cts
Password for bkvarda@CLOUDERA.INTERNAL: [enter your password]
ktutil: wkt bkvarda.keytab
ktutil: quit

You should have a keytab created like as shown below:

My keytab located at /home/bkvarda/bkvarda.keytab

You’ll obviously want to validate that this works. You should be able to kinit successfully with your keytab:

[bkvarda@ip-10-0-0-29 ~]$ kinit bkvarda@CLOUDERA.INTERNAL -k -t bkvarda.keytab
[bkvarda@ip-10-0-0-29 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_501
Default principal: bkvarda@CLOUDERA.INTERNAL
Valid starting     Expires            Service principal
11/03/16 22:49:23 11/04/16 08:49:21 krbtgt/CLOUDERA.INTERNAL@CLOUDERA.INTERNAL
renew until 11/10/16 21:49:23
[bkvarda@ip-10-0-0-29 ~]$

Once we have this, we’re ready to use sparklyr!

Installing the sparklyr library and starting your first Spark session in R

Login to RStudio. If you aren’t using RStudio, you can probably just do this in your R shell, but I haven’t tried it and will be focusing on RStudio here.

To install sparklyr, run

install.packages("sparklyr")

This should take a bit, but when it’s done there should be no errors complaining about dependencies and it should say something along the lines of “all done”.

Then we import sparklyr:

library(sparklyr)

We read in the environmental variables defined in Renviron:

readRenviron("/usr/lib64/R/etc/Renviron")

Then we create the sparklyr version of SparkContext using the principal and keytab we created in a prior step:

sc <- spark_connect(master = "yarn-client",version = "1.6.0", config = list(default = list(spark.yarn.keytab = "/home/bkvarda/bkvarda.keytab", spark.yarn.principal = "bkvarda@CLOUDERA.INTERNAL")))

If this is successful, it should happen pretty quickly. If something is wrong, it usually takes a bit of time to timeout. Anything that mentions “GSSException” is going to be due to something wrong with Kerberos — it could be your keytab, the principal was mistyped, etc. If there are no errors, type in the following:

sc

You should get back a bunch of information — a SparkContext object, a HiveContext object, a DBI connection, etc. It looks like this:

Output from running ‘sc’

In addition, assuming the Hive Gateway role has been installed and nothing is wrong, you should actually be able to see the available tables in the top right:

Tables available through HiveContext

We could then use something like DBI to read from a table:

library(DBI)
test <- dbGetQuery(sc, 'Select * from default.test_table')
test

You should see something like the below (my test_table only has 3 rows):

Putting it altogether (with some typos included)

And if we look at running applications in YARN in Cloudera Manager, we should see this:

sparklyr in YARN

We even get easy access to Spark logs through the RStudio interface for easy troubleshooting. Just click the ‘log’ icon in the Spark pane in the top right:

Spark logs available through RStudio

That’s all for now, hopefully this helped you get set up. If you have corrections or comments please leave them below!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.