Experiences in Using R and Python in Production
Starting my new Medium blog by resharing our great experiences from May 2016.
Python and R are some of the best open source tools for data science. They can be easily used for scripting and custom analysis but running them automatically as part of an online software requires more consideration. At Smartly.io, we’ve been using them both extensively. In this blog post, I’ll share some of our experiences of integrating them in production.
As background, one of our first use case that needed statistical computing was Predictive Budget Allocation. It automatically reallocates a budget between multiple ad sets to optimize the total cost-per-action level in a marketing campaign. It is run automatically for huge amount of campaigns on a daily basis.
To separate the statistical analysis from the rest of our online platform, we created a new microservice. We wanted to use the statistical software for where it’s the best: to do the math — and to do the rest elsewhere. Basically, the service gets all the input data that it needs and gives an output. In our case, the input data contains all the necessary historical data (impressions, conversions, spends, budgets, etc.) for all the ad sets and it gives the new budgets as an output. Below is a simplified example of input and output in JSON format.
Originally, we created the microservice by using R. There exist a few R packages to wrap a custom R library into an HTTP API, notably at least DeployR and OpenCPU. Briefly, these services allow you to call R functions via HTTP. A year ago we evaluated the alternatives and chose OpenCPU for our case. We didn’t try the other alternatives ourselves so our hands-on experience is limited to OpenCPU.
Installing OpenCPU is very straighforward on Linux. To use OpenCPU, you have to create a custom R library. RStudio has a good documentation to start. After you’ve installed your library inside R, it’s connected to the OpenCPU server, so you can start to call your R functions via HTTP calls.
In our case, for example, we have a R function called
BudgetAllocation(d) that takes the input data in a data frame
d. The function is available in our internal package
smartlybuster. With the OpenCPU we are able to call the BudgetAllocation function via http://localhost:8080/ocpu/library/smartlybuster/R/BudgetAllocation/json with the above JSON as an input and get the output of the function as a response.
We used R with OpenCPU for about half a year without any issues. Our R code base started to grow and in the end we realized that Python is better suited to our purposes as it's a more proper programming language. Initially, we did almost one-to-one conversions from our R code to Python and got it done in about a week. Proper unit tests for our R library helped the conversions greatly.
With Python we ended up using Flask as it provides similar HTTP API for Python as OpenCPU does for R. We run Flask using Gunigorn. As our BudgetAllocation was an isolated microservice, we could easily switch the old R service to the new Python service. In the transition period we ran both of them at the same time to evaluate that the results match.
The transition from R to Python was fairly straightforward. By using the pandas library, we got similar dataframes as in R. One of the biggest surprises came from how pandas handles the subset and index in the dataframes. In pandas, getting a subset of the rows returns a dataframe where the rows are named with their original row numbers. This can be fixed with a reset_index() call allowing you to access the data again with d.loc[i, 'budget'] type of access.
Python, pandas and Flask have worked well for us after the migration. There are parts that would benefit from rewriting in proper Python fashion. All in all, we feel that both R and Python are well suited tools that are easy to run in development. Python is a more proper programming language readily suitable for development whereas R has somewhat better statistical libraries.
In the future, we see that we may end up using both at the same time. Python will serve as the main tool for handling most of the cases. R can be used for some specific statistical parts. To do this integration, we can continue using OpenCPU to call R from Python or switch to specific integration libraries like rpy2.
Originally published at www.smartly.io.