CivisML: Scikit-Learn at Scale
by Stephen Hoover
Late last year, my colleagues on the Social Science team were working on a new survey weighting scheme that would greatly improve the precision of our public opinion data. To make it work, they needed to fit dozens of models for each completed survey. Each survey asks multiple questions, each of which would need to be modeled individually, using an ensemble of models fit with different features. On top of that, they would need to fit multiple sets of models to optimize hyperparameters such as the depth of the decision trees used or the strength of the ridge penalty in the logistic models. It wasn’t feasible to do on their laptops, and they didn’t have the time to manage their own AWS instances every time a new survey came in — and we do a lot of surveys here.
We realized that the problem of scaling a modeling operation exists for countless data scientists beyond our Social Science team, and that we could use the Civis Platform to solve it. So we got to work, and today, we’re introducing CivisML to the platform, a machine learning service that gives data scientists the infrastructure to run the open source algorithms they know with the scale they need.
Data scientists here at Civis Analytics use machine learning to predict health insurance status, television viewing habits, or even who should win the Oscar. We don’t just fit one model at a time, though. To end up with the best model for any given outcome, we explore different algorithms and hyperparameters. And we’re often modeling more than one outcome in parallel, such as multiple survey questions. That’s a lot of models to build. On top of that, when we find a model that works well, we need to use it to make predictions for huge numbers of people — often every adult in the United States.
As our Social Scientists discovered, all of this is tricky to manage yourself. You could train your models on a laptop, but only one or two at a time. And the data I often want to use for prediction doesn’t even fit on my laptop! This is where the Civis Platform comes in. The platform has always been able to parallelize your models and predict on large datasets, but with today’s release of CivisML, our modeling is much more powerful and flexible.
CivisML is a machine learning service accessible in Civis Platform. It’s built around scikit-learn, the popular open-source machine learning library for Python. Scikit-learn has a well-defined API, which lets CivisML handle any model that you can define with the library. With CivisML, you can hand a scikit-learn model to Civis Platform and the platform will fit the model, store the results, and plot model diagnostics such as the ROC curve. (You can read how to use it in our documentation.)
Behind the scenes, the Civis Platform leverages AWS to provide all the data storage and computational capacity you need. If you need to fit a dozen models, Civis will add enough EC2 instances to handle the load (using Docker containers for fast setup), move data from your tables to where the computation is happening, fit your model (using scikit-learn), then store the results and shut down any excess instances afterwards. If you want to take those trained models and make predictions for every row in a billion-row table, Civis will distribute the data to many EC2 instances as easily-handled chunks then collect the results afterwards. All of this is using the same modeling code you’d run on your own laptop, but now multiplied over as many instances as it takes to get the job done.
It’s also easy to use the Civis Platform to chain your models together with other jobs and schedule everything to run on a regular basis. For example, you might regularly import data from Salesforce, run a SQL script to join to additional sources of data, fit a model on the resulting table, and assign scores to a database of prospects — then schedule it all to run automatically every morning.
CivisML in Action
Our own data scientists recently used CivisML to help us find prospective customers for a fitness industry client’s new location. Because the client didn’t have a database of existing customers at this location, we relied on a national consumer survey as the training data to model good prospects. The survey data had answers from about 30,000 people to questions such as “How often do you exercise?” and “Do you play team sports?”. We used CivisML to fit models on 300 different survey-based target variables, trying several different algorithms for each model. We used the model diagnostics output by CivisML to select the 80 models which were most predictive and then used CivisML to predict responses to those 80 survey questions from the 3 million people living closest to the client’s new location. Finally, we ensembled the predictions to create a list of the people who would likely be most interested in the client’s services. Because CivisML runs these models in parallel, the entire process could happen in one work day!
You can use CivisML either through the easy-to-use modeling module in the Civis Platform or with our Python API client. (Don’t worry, an R version of our API client is in the works and will be available later this year.) If you want to try it out, get in touch today and take a look at this sample notebook to get started!