Image for post
Image for post

Multiple Languages, One Team: Bringing R and Python Together

by Michael Heilman

Civis Analytics
Dec 15, 2017 · 2 min read

Data scientists work in unique contexts and use different programming languages, even within the same organization, making collaboration a constant challenge. Consider two data scientists, whom we’ll call Alice and Bob.

Alice and Bob need to work together on a project with a short timeline and need to send slides to the client in the next couple days. Alice knows the data for the project well — it’s messy log data in JSON that doesn’t fit naturally into a relational database — and she’s a Python user. Bob, on the other hand, mostly uses R, knows the business problem better, knows what the client expects, and works in a different department at a different office. Bob needs to run some analyses on the data with some different parameters, but the data — six months worth — is too big to fit in memory on a laptop for some of the analyses Bob wants to run.

How can they meet the deadline without major headaches?

I was in the position of Alice a few weeks ago.

We first tried to do everything on our laptops, but it quickly became clear that it’d be difficult to collaborate because of the size of the data, our programming languages of choice, the dependencies we had set up on our laptops, etc.

Thankfully, we had another option.

We used Civis Platform to set up a container script that took parameters (e.g., a date range to filter by) that would affect processing of the raw data and produce results that my colleague could use to run the different analyses she wanted to try. Basically, I deployed a very simple microservice.

Civis container scripts run in the AWS #Cloud in user-specified Docker containers, with user-specified commands and code from GitHub, and with user-specified compute and memory resources (see the Civis API Client documentation page for more details). This meant that I didn’t have to worry about the Python vs. R issue or the size of the data because the servers used by container scripts have lots of memory available. All I had to do was write some Python code and expose a simple set of parameters my colleague could configure. She was then able to use the Civis R client to run my Python code via the container script and grab the aggregated results. After that, she made the slides and shipped them. (Also, for a related project, we’re automating the generation and delivery of the slides as new data arrives).

With Civis Platform, we were able to deploy a reproducible, scalable pipeline that allowed us to work in our language of choice. Alice and Bob would approve.

Civis Analytics helps the country's largest companies and…

Civis Analytics

Written by

The Civis Journal
Civis Analytics

Written by

Building a Data-Driven World | civisanalytics.com

The Civis Journal

Civis Analytics helps the country's largest companies and nonprofits identify, attract, and engage loyal customers and employees with a blend of proprietary data, software solutions, and an interdisciplinary team of data and survey science experts.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store