As a data science startup, we write a lot of R scripts. Since we often work with very large amounts of data, our R scripts usually have high CPU and Memory usage. Moreover, these R scripts may take hours or even days to finish. It doesn’t, therefore, make sense to run them on our workstations. That’s why we built a small in-office cluster.
Our cluster consists of 5 dual-core machines running Ubuntu. The individual machines are not exactly powerful but they were sufficient for the amount of data we used to have. The way to run a new script was to SSH to one of the cluster’s machines, make sure that the machine is not already running a script, move the data to this machine and finally start the R script. This is obviously awful for multiple reasons. First: If you wish to run different scripts on the same data, you’ll have to duplicate the data on multiple machines. Second: The scheduling task is time wasting and can be automated. Third: The same environment has to be maintained on multiple machines, if we install a new module it has to be installed on all the cluster machines. Fourth: There isn’t any way to view the output of the scripts without pulling the data from the cluster.
Dockerizing the Environment
The first problem we tackled was the idea of maintaining the same environment. The first solution that came to mind is docker. We packed our R environment in a docker container and pushed it to a local docker repository. All scripts ran into the container and the newest container version is ensured by a “docker pull” before starting the script. Updating the environment on all machines became as easy as pushing a new container version to the repo.
The second problem to tackle is moving the data to the cluster and from one machine to the other. To solve this problem, we decided to go with the easy solution. Network File System (NFS). Setting up NFS is straight forward, the main disadvantage of NFS is centralization. If the NFS server fails, the whole setup will fail. In a cluster of 5 machines, we can afford to restart it manually in case of failure.
We started an NFS server on one of the cluster machines, and mounted the NFS partition to the same path on all the other machines. The mounted partition is then mounted to all the script containers started. By this, all the scripts have access to all the data they need. Also, their output is accessible from any other machine.
The Distributed OS
The next problem to tackle was the scheduling problem. Searching for a machine with enough resources is trivial and time wasting. Also if there isn’t any free machine while you are searching, you won’t be able to queue this job. That’s when Apache Mesos came into play. “Program against your datacenter like it’s a single pool of resources” .. that’s exactly what we were looking for. Setting up and configuring Mesos on the cluster wasn’t exactly a walk in the park but eventually we managed to get it done.
At this point, one needs to go through the following sequence to schedule a job: Move the data and script to NFS, SSH into “any” machine in the cluster, schedule a command to start a container, specify the resources needed by this script, and wait for the script to finish. In the background, Mesos schedules the container to one of the slave machines provided it has the resources needed by the script at hand. If there isn’t any free machine, the job gets queued until some resources are freed.
The R Console
Although at this point scheduling a job has become much easier, we felt we could do better. We aimed for an easier, more usable way to run scripts. Eventually, we started developing a simple Go server to add a management UI for the process. We named it R-Cluster. Through R-Cluster, you can :
- Upload data/script files.
- Write a (syntactically highlighted) R script and run it directly.
- Execute arbitrary commands in the container (e.g. ls, cat, wget). The Standard output of those commands is streamed from the server.
- Access the files generated by the script.
To get it working we decided to have the following directory layout in the NFS partition :
| |-- code --> /code
| |-- input --> /input
| |-- output --> /output
| | -- ...
| | -- ...
- Using the same TaskId for different jobs insures the same directory, hence maintaining the state.
- All directories in the “working_dir” of each task are mounted to a corresponding path in the container.
- Whatever is in the “/output” directory is intended to be publicly accessible.
We deployed this on our cluster, forwarded a certain port from the router to the server and we were done. Anyone can now schedule scripts on the cluster and tail its output.
As a bonus, the Go server offers a kind of HTTP API which allows any script to programmatically run R scripts on the cluster.
As we grew, the data we handle also grew. Soon enough, our in-office cluster became quite incapable of handling some of the data we process. So where to go?
To the Cloud!
At that point, our aim became to deploy an on-demand n-machine cluster in the cloud with the least effort. Ideally, a single click. After some research, we found a one-click deployment of a mesos cluster on Azure’s official repo of quickstart templates. Azure templates are basically JSON files specifying the resources needed for a certain deployment. Finding an Azure template that starts n machines with Mesos and docker installed was just perfect. We are also on Microsoft’s BizSpark program for startups and that gives quite a bit of Azure credit.
We forked the template and started adding the missing parts. We started by modifying the template to pull and start the R-Cluster and open the required ports in the security group and the load balancer. We removed some of the features that we don’t use such as swarm/marathon. Last but not least, we had to find a way to share storage. Fortunately, Azure provides a service called Azure Files. Azure files can be mounted easily to multiple VMs at once which is awesome! We got a kind of NFS alternative out of the box. We pushed our R environment docker image to docker hub to make it accessible to the slaves and we were done!
You can try it yourself by clicking on the “Deploy to Azure” button in the template repo! Knock yourself out! ;)
Note: At the time of this writing, the R-Cluster and the Azure template are still under heavy development. They are not production ready yet. That said, they can still be very helpful for quick standard tasks. Also they are both open source, so your contributions/feature requests are more than welcome.
We are a team of data scientists and software engineers working to help enterprises makes the most out of their data. Our projects range from data analysis to extract insights, to predictive analytics to support decision making, to scalable production ready data products. Our focus areas are (1) personalization and (2) operational efficiency.