Business analytics on the cloud — a scalable model with R

Sahaj Software
inspiringbrilliance
6 min readAug 8, 2018

By Praveen C

The language R allows you to write all kinds of data analysis functions and perform math along with it. In order to use R and become effective at it requires knowledge of programming and not everyone has the time or the inclination to play with the R functions. The need was building a platform that makes this analysis a point and click activity.

The Problem

Web applications and interactions are designed to be used by multiple users and hence are scaled that way. Using R as the backend for such processing poses a serious issue — because R is single threaded. Our first challenge was to use R in this execution environment while being able to scale the web application.

The other challenge was to have business users who are not programmers to be able to run analytic functions without having to write code, and efficient processing and presentation of large datasets on the browser .

Out of box solutions

  1. Microsoft R Server: Microsoft gives a black box solution with Microsoft R Server. However, this becomes a costly proposition both monetarily and due to latency when executing functions via the browser.
  2. Spawning a new R runtime/request: Loading functions dynamically and spawning new runtimes is expensive. In order to do it effectively you have to bear the burden of engineering it to scale.
  3. R Web Packages(eg. Shiny): Paid professional versions of these packages are available but get expensive with scale.
  4. Lack of Stable native message queue client libraries: Unavailability of these libraries makes plugging into horizontally scalable architecture difficult.

Picking the one that fits

Few factors which heavily influenced the solutioning were

  1. With ever growing number of functions, many of which are interdependent and need external libraries, any solution which has to spawn a new execution environment won’t be responsive enough. We need a cluster of R processes with all functions and libraries loaded in memory for R processing
  2. Given the above factor, the best architectural fit was an asynchronously scalable R process cluster talking to a message queue.
Scaling with R on the cloud

Building the Message Queue Client for R

There were a few attempts at writing R queue client libraries on the web but they had fizzled out. Additionally, we were looking at a client for RabbitMQ which didn’t have an R client distribution. After further research, we came across the “rJava” library which helps interface with Java runtime over JNI(Java Network Interface) from R. We decided to write a thick wrapper above the RabbitMQ Java client. The idea was to reduce JNI calls and complete queue operations in fewer method invocations. Now our custom written R queue client library will interface with this coarse grained Java message queue client, pick up execution messages from the queue, and process them. After processing, the summary of the execution result was posted back to the queue using the same library.

Point click Function Execution

The user executes a function by choosing data columns he wants to work with and search/click the appropriate function from the menu bar.

The function execution message essentially contains the metadata for processing like the dataset and column reference, function details and user input parameters. This metadata is passed on to a Java service, which is the interface to the queue and R execution environment. Based on the function chosen for execution, the service populates additional metadata for processing, e.g. a flag to indicate whether the function generates a new dataset like Subset operation or which modifies the existing dataset like Treat Missing by Mean.

It also stamps each execution with a unique Id and stores the same in the database. Finally, a message which encompasses the function execution metadata and execution Id is pushed to a queue for consumption by R.

REngine — Preload functions and optimization

We wrote an REngine component in R which orchestrates the message execution.This, on start will load all functions and libraries onto memory and keeps polling indefinitely for messages on the configured message queue. It could talk to the queue using the R queue client library we wrote. Once a message is received, it reads the metadata of the function to be executed, looks up and execute the function preloaded in memory by passing in the parameters which came in the message, and writes the output to the file system.

Input and Output file delivery

Now, when the processing completes only the execution status is sent back to the queue which is read and updated in the database by the upstream Java service. What is also sent if the execution is complete, is the links to the output files generated as a result of the execution written to cloud file storage system. In this case, we used AFS(Azure File Storage) provided by Azure mounted as local drive on the R servers as well as the Nginx servers used for delivering input and output files to the UI. The UI will use the links received from the response to fetch the output files through the Nginx web servers.

Big Data Processing

Using an intelligent assessment of the data size and characteristics of the function to be used, the system determines the optimal path to be taken for processing. The solution has multiple flavors of back-end processing with different clusters of servers which can do processing of R or python functions and high-performance Spark based clusters for big data processing. We used SparkR package as the interface from R to Spark. Other mechanisms of processing function execution requests and output delivery remains the same for spark processing.

Handling large amounts of data

In most cases, the users just need a feel of the data to work on it, not necessarily every row in the dataset. Of course they want to see all the columns to infer what are the variables they can do analysis on, and a feel of data which is there in each column. We use summary like functions in R to give users enough information about the dataset like no of columns and relevant information related to each column like data type, mean, max, number of missing etc and show only first few hundred rows in the browser. We also use efficient javascript streaming libraries like Papa Parse to bring only limited set of rows onto the UI to reduce network transfer and improved performance. Even when users execute functions by choosing columns, only the metadata of dataset reference and column names are passed onto the backend for execution, along with input parameters given.

Keeping UI “thin” — Making the platform extensible

We went with single page app and used popular UI libraries like React, Redux etc which performed well and provided nicely factored code. Not to mention, what we also followed is HATEOAS style of API contracts between the UI and server. This was done with the UI following navigation links sent from the server to keep the client side logic minimal. As we port the app to native mobile apps, this has become a life saver!!

Wrapping up

We worked around the inherent problem of scaling R which is single threaded by plugging into a horizontally scalable architecture using a message queue. To achieve this we had to write our own message queue client as well as a message orchestrator/executor in R with all functions and associated libraries pre-loaded in memory. The architecture facilitated the moving of metadata alone across layers. The data was always close to execution making smart use of cloud network storage solutions like AFS. For processing big data, we setup Spark clusters with an rSpark interface. The upstream java service will do the routing of requests based on data size and function characteristics. The data shown on the UI is representative but minimalistic. The above design considerations provide a seamless experience of analytic function execution over the browser in the context of courses or business simulations undertaken by the user. Unlike most of the existing products which run on the user’s desktop, above approaches have helped move business analytics solution using R to the cloud.

--

--

Sahaj Software
inspiringbrilliance

Pioneers in combining creative solutioning, intuitive design and technical excellence to deliver simple software solutions to complex business problems