Data Scientist’s Toolbox for Data Infrastructure I

by Ke Shen and Paul Grech

Recently, the terms “Big Data” and “Data Science” have become important buzzwords; massive amounts of complex data are being produced by business, scientific applications, government agencies and social applications.

“Big Data” and “Data Science” have captured the business zeitgeist because of extravagant visualizations and the amazing predictive power from today’s newest algorithms. We’ve nearly approached mythical proportions of Data Science as a quixotic oracle. In reality, Data Science is more practical and less mystical. We as Data Scientists spend half of our time solving engineering infrastructure problems, designing data architecture solutions and preparing the data so that it can be used effectively and efficiently. A good data scientist can create statistical models and predictive algorithms, a great data scientist can handle infrastructure tasks, data architecture challenges and still build impressive and accurate algorithms for business needs.

Throughout this blog series, “Data Scientist’s Toolbox for Data Infrastructure”, we will introduce and discuss three important subjects that we feel are essential for the full stack data scientist:

  1. Docker and OpenCPU
  2. ETL and Rcpp
  3. Shiny

In the first part of this blog series we will discuss our motivations behind implementing Docker and OpenCPU. Then follow our discussion with applicable examples of how Docker containers reduce complexity of environmental management and OpenCPU allows for consistent deployment of production models and algorithms.


Environment configuration can be a frustrating task. Dealing with inconsistent package versions, diving through obscure error messages, and waiting hours for packages to compile can wear anyone’s patience thin. The following is a true (and recent) story. We used a topic modeling package in R (along with Python, the go-to programming language for Data Scientists) to develop our recommender system. Included with our recommender system were several dependencies, one of them being “Matrix” version 1.2–4. Somehow, we upgraded our “Matrix” version to 1.2–5, which (unfortunately for us) was not compatible with our development package containing the recommender system. The terrible part of this situation was the error messages did not indicate why the error occurred (apparently due to a version upgrade), which resulted in several hours of debugging in order to remedy the situation.

Another similar example is when our R environment was originally installed on CentOS 6.5. By using ‘yum install’ we only obtained R with version 3.1.2, which was released in October, 2014, and not compatible with many of dependencies from our development and production environments. Therefore, we decided to build R from source, which took us two days to complete. This was due to a bug from the source which we had to dig into the source code to find.

This begs the question, how do we avoid these painful and costly yet avoidable problems?

SIMPLE! With docker containers, we can easily handle many of our toughest problems simultaneously. We use Docker for a number of reasons, with a few of the most relevant mentioned below:

  1. Simplifying Configuration: Docker provides the same capability of a virtual machine without the unneeded overhead. It lets you put your environment and configuration into code and deploy it, similar to a recipe. The same Docker configuration can also be used in a variety of environments. This decouples infrastructure requirements from the application environment while sharing system resources.
  2. Code Pipeline Management: Docker provides a consistent environment for an application from QA to PROD therefore easing the code development and deployment pipeline.
  3. App Isolation: Docker can help run multiple applications on the same machine. Let’s say, for example, we have two REST API servers with a slightly different version of OpenCPU. Running these API servers under different containers provides a way to escape what we refer to as, “dependency hell”.
  4. Open Source Docker Hub: Docker Hub is easy to distribute Docker images, it contains over 15,000 ready-to-use images we can download and use to build containers. For example, if we want to use MongoDB, we can easily pull it from Docker Hub and run the image. Whenever we need to create a new docker container, we can easily pull and run the image from Docker Hub
`docker pull <docker_image>`
`docker run -t -d --name <container_name> -p 80:80 -p 8004:8004 <docker_image>`

We are now at a point where we can safely develop multiple environments using common system resources without worrying about any of the mentioned horror stories simply by:

`docker ps`
`docker exec -it <container_name> bash`

Our main structure for personalized results is shown in the image below. We have three docker containers deployed in a single Amazon EC2 machine running independently with different environments yet sharing system resources. Raw data is extracted from SQL server and goes through an ETL process to feed into the recommender system. Personalized results are called from RESTful API through OpenCPU and return in JSON format.


OpenCPU is a system that provides a reliable and interoperable HTTP API for data analysis based on R. The opencpu.js library builds on jQuery to call R functions through AJAX, straight from the browser. This makes it easy to embed R based computation or graphics in apps such that you can deploy an ETL, computation or model and have everyone using the same environment and code.

For example, we want to generate 10 samples from a random normal distribution with mean equals to 6 and standard deviation equals to 1. First, we need to call a function called ”rnorm” in R base library. Performing a HTTP POST on a function results in a function call where the HTTP request arguments are mapped to the function call.

`curl -d “n=10&mean=5”`

The output can be retrieved using HTTP GET, when calling an R function, the output object is always called .val. In this case, we could GET:


And here are the 10 samples:

Now imagine this type of sharing on a large scale. Where an analytics or data team can develop and internally deploy their products to the company. Consistent reproducible results are the key to making the best business decisions.

Combining Docker with OpenCPU are great first steps in streamlining the deployment process and moving towards self serviceable products in a company. However, a Full Stack Data Scientist must also be able to handle data warehousing and understand the tricks of increasing performance of their code systems scale. In part 2, we will discuss using R as an ETL tool which may seem like a crazy idea, but in reality R’s functional characteristics allow for elegant data transformation. To handle performance bottlenecks that may transpire, we will discuss the benefits of RCPP as a way of increasing performance and memory efficiency by rewriting key functions in C++.

Originally posted at Ladders Engineering