How to do an efficient R API?

Jean-Baptiste Pleynet
6 min readApr 24, 2019

This article is written based on my (painful) experience of trying to do an industrial R API. During this journey, I discovered that this specific usage of R is not well documented, and this is what brings me to share here my lessons learned.

To illustrate the article, let pretend that we want to create a financial API: the caller asks for a specific asset at a given date, and the API compute and return indicators like average or volatility.

Optimising R code

The first things to do is of course to have an efficient R script from the beginning.

A lot of documentation already exists on this topic (just type it on Google), so I will not go too much into details. Some hints: Avoid for loop (use apply instead), use parallel calculation (parallel package and mclapply function), compiling functions (compiler package and cmpfun function), etc.

Dockering and Kubernetesing your program

One major step to industrialise the process is to pack all your code in a docker, to deploy it quickly and in a reproductible manner.

The real pleasure with Docker is that once it works on your computer, it will work anywhere in the cloud (network problems aside).

A lot of literature already exists on how to have R in a Docker. Personally, I use Rocker . Using Docker can be a bit tricky at the beginning but once you have it; it really saves you a lot of time and energy.

The good combination with Docker is Kubernetes. Kubernetes automatizes a lot of the deployment of docker and provides some interesting native complements, like autoscaling or load balancing.

Most of the cloud service providers offer an easy to use (how easy as it can be) Kubernetes. Personally, I use Google (because I also use Google SQL database and I find AWS way too complicated for a newbie).

You can then deploy how much dockers you think is needed in parallel and the load balancer will do the orchestration. Keep in mind that autoscaling is interesting, but a bit slow to react. This is useful for progressive increase of load, which was not applicable to me.

In case it can help, here is an extract of my DockeFile :

FROM rocker/r-base#Install some needed Ubuntu packagesRUN apt-get update && apt-get install -y libcurl4-openssl-dev libssl-dev libmysqlclient-dev#Install only one packageRUN R -e ‘install.packages(“Rserve”,,”http://rforge.net/",type="source")'#Run a script that contain all the package to installRUN Rscript “ToInstall.R”#Because I use Google SQL Cloud, I need to download this connectorRUN wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy && chmod +x cloud_sql_proxy#The old launcher, used with plumber#CMD ./cloud_sql_proxy -instances=InstanceName -credential_file=’MyJSONCredentials.json’ & Rscript “MyPlumberAPIScript.R”#The new launcher, using Rserve and an entry point applicationCMD bash ./OneCustomScript.sh & R CMD Rserve — RS-conf “./Rserv.conf” — RS-port 6311 — no-save & /usr/bin/java -jar /opt/r-runner/r-runner.jar

Using plumber

Plumber is the first step to transform a R program into a proper API.

Plumber is simple to setup and reliable.

But it suffers from a major weakness: it is mono-thread. Whatever you do after (if you parallelise your code after), plumber is a mono-thread entry point, meaning that if your code takes 2s to provide and answer, you can only manage 1 request every 2 seconds, despite how powerful your server is.

The only way you can overpass this is to use autoscaling and load-balancing capability of Kubernetes: Deploys multiple dockers with mono-thread plumber in. It works, but it is very memory expansive.

Entry point application

The other idea is to use R for what it is good: calculation, and use another language (Java for example) for providing the API, and act as a front or an entry point for R.

So, calls arrived in a Java app, and this Java app launch an independent thread, that will call a R program, and that program will do the calculations and send back the results, that Java will provide as the answer.

This sounds more efficient, but the question stands on how to transfer calculation to a R script.

One solution is rather simple: The Java program calls an R script using “Rscript” system command, and passes as a unique augment the content of the call in one single string.

This is easy to put in place, but in my case it suffers from a serious problem : if every time the script starts, you have to load multiple libraries and define a lot of custom functions and constants, this can takes seconds. Far too much time and computing power lost in this action at every call.

Here is a simple example of R script that can answer the call:

args = commandArgs(trailingOnly=TRUE)invisible(capture.output(library(jsonlite, quietly = TRUE)))if (length(args)!=1) {    resultat_final = list(Status=”Error”,Error = “The script must have at least 1 argument”)} else {    recuperation = fromJSON(args[1])    resultat_final = list(Status=”Succeed”,Value = paste0(“Assets : “, recuperation$Assets))}resultat_JSON = toJSON(resultat_final, auto_unbox = TRUE)resultat_JSON

Using Rserve

Rserve is a powerful tool that allows you to launch a R server locally, and to call this R server to delegate it computations from another programme (in Java typically).

This has 2 main advantages:

  • Rserve will launch a new thread (by forking) for every request sent. So this is naturally multi-threaded.
  • You can preload something in your Rserve, that will be present for every future computation. So you are able to load all your libraries and functions prior launching Rserve.

Those two elements made Rserve the perfect candidate in my quest for performance. Now I have an API multithreaded, memory efficient and time efficient.

Here is the example of a function I defined:

My_function = function(original_request) {    request = fromJSON(requete_origine)    [Do stuff]    JSON_result = toJSON(final_result)    return(JSON_result)}

And here an example of the call in Java :

RConnection c = new RConnection(rserveHost, rservePort);REXP result = c.eval(function + “(‘“ + json + “‘)”);return result.asString();

Bonus: Use database connection

One other problem I had: in order to do the calculations, I needed data from a MySQL database.

This operation is time consuming, but there are multiple ways to optimise it:

  • Delegate most of the computation for the database engine. You need to request 3 chunks of data, then filter and order them? Put it in the SQL request, and try to do it in the minimum SQL request and with the maximum filter or JOIN in. Optimising an SQL request is in itself a science, but it can greatly improve the efficiency.
  • Use the cache ability of the database: If you send multiple times the same request, the database will not load the result from the data, but from the cache where it stores all the previous results. This can save an important time.

Last point: You obviously need to open a connection to the database in order to request the data. This operation is not very long, but long enough to be optimised.

You can, using Rserve, define variables that you will use afterwards. But if you define one connection, and use it for every calls you will have a problem, because you cannot use the same connection in multiple threads at the same time.

The solution I chose was to create a pool of connections.

In Rserve initialisation, I am defining different connections to the same database, more than the maximum number of threads. For example, if I allow 10 maximum threads in my Java application, I would create 15 connections.

After that, when a calculation is called, the first thing I do is to choose one connection in the pool. In order to do that, I use the ID of the process launched (by fork) by Rserve to do the calculation. Those IDs are, to my experience, incrementing. So, doing that, I am reasonably sure that two threads will never use the same connection at the same time.

Here is the code to illustrate that. Definition of the pool:

pool_connections = list()for (i in 1:connection_pool_size) {    pool_connections[i] = create_DB_connection()}

And initialisation of the connection:

select_DB = function(){    number = (Sys.getpid() %% (connection_pool_size — 1)) + 1    connectionDB = pool_connections[[number]]    return(connectionDB)}

Bonus 2: Use chronometer and report it

Another interesting hint: put as much chronometer as you can in you code, and report it as a result of the call, in a “debug” node in your JSON response for example.

Doing that, you will be able to track what is really happening in your target infrastructure and when the system is in stress. This is very useful to see in which ways it can be optimised: it takes time to request data? Strengthen the database. It takes time to compute? Add more CPU power, etc.

Here is how I measure time:

begening <- proc.time()
[Do stuff]
debug[[“StuffName”]] = as.numeric(proc.time()[3] — begening[3])

Conclusion

Doing an R API was a challenge, first because R is not made for that, and second because for this reason this is not something well documented on the internet.

But this is possible! R is a powerful calculation tool, and doing an API using all the libraries and the power of R is something possible.

I hope this helped. Anyway, do not hesitate to comment and to complete if needed.

PS : A great thanks to Kevin and Guillaume for they review

--

--

Jean-Baptiste Pleynet

Actuaire de formation, travaillant au service des assureurs vie du Luxembourg depuis 5 ans, et passionné de blockchains, il adore vulgariser cette technologie.