Evolving From Descriptive to Prescriptive Analytics: Part 4, Eating the Ugly Frogs

Shaikh Quader
Inside Machine learning
5 min readJun 6, 2018

By Shaikh Quader and Chad Marston

In our previous blog posts, we discussed gaining leadership support, acquiring data science skills and having the tools to manage your data. With this post, we’ll discuss ensuring your data scientists are productive and happy.

What activities keep data scientists happy? A recent CrowdFlower report about data scientists says that mining data for patterns, building models for data, and refining algorithms are the three favorite tasks among data scientists. Most other tasks are not nearly as interesting. We call these other tasks the ugly frogs of data science. Only 19% of data scientists can spend most of their time in doing their favorite tasks, while the others spend most of their time on tasks they loathe. Wouldn’t it be nice if the data scientists could be liberated from doing what they like least and allowed to do what they enjoy most?

Let’s look at three tedious tasks — or uglier frogs — that a data scientist could stop doing themselves if they started using a platform that manages these tasks. In particular, let’s look at IBM’s Machine Learning platform, the Data Science Experience (DSX).

Importing Data from Relational Databases

In one of our recent machine learning projects, the task of acquiring a large volume of data from a Db2 database significantly slowed our progress. We had written a SQL script to generate a dump of the data from the database, exported the dump to another system, and then finally loaded it into a Jupyter notebook. Even so, the acquiring of the data involved multiple manual steps and it was boring.

In fact, this process for exporting the data took us nearly half a day — and we had to repeat this dull process for every data refresh. Our search for an alternative soon led us to DSX, which let us import large tables from remote relational databases into pandas Dataframes with just a few lines of auto-generated code or sample code stub provided in the DSX documentation. On top of that, it supported the most popular database vendors — not just Db2 as you’d expect with an IBM offering, but also Oracle, MS SQL, PostgreSQL, etc.

Installing Machine Learning Libraries and Keeping them up-to-date

A few months earlier, we were trying to install XGBoost on a Mac. We’d tried everything documented on the XGBoost’s official site, but failed to get it installed. Finally, a co-worker gave us the working steps, which she’d found by digging through several Q&A threads. Another co-worker tried to install Apache Spark and it took her nearly two days to figure out the right steps and to get everything working.

But the pain didn’t end there. Once we’d installed these third-party libraries, we faced challenges with keeping the libraries up-to-date. Multiply that work of installing and maintaining the libraries by the number of people on the team and it’s obvious what a drag this was on our collective progress.

By contrast, DSX comes with a suite of popular ML libraries pre-installed, including scikit-learn, XGBoost, Jupyter notebook, Apache Spark, TensorFlow, Keras, Theano, etc. It can be as easy as firing up a notebook and writing code to use those libraries. Even better, our DSX admin can make updates to installed libraries with a single command, and the updates are immediately available to the whole team. Just as an example, to update XGBoost, the admin simply runs a one-line command inside a notebook:

!pip update xgboost

Or individual users can update their own copies of the library with a similar command:

!pip update — user xgboost

Deploying and Monitoring the ML Model

In the CrowdFlower report we mentioned earlier, 31% of data scientists identified model deployment as their biggest bottleneck. Once you’ve built a model and you’re satisfied with its performance, you still might be unsure about how to deploy it to production. Books on machine learning and data science are filled with techniques for building and evaluating models, but very few discuss the full deployment path.

That path has two main parts: smooth deployment of the ML models and then monitoring how the models perform post-deployment. We learned about both parts the hard way. For our first two ML models, we deployed using non-standard methods. After building and testing the models within the Jupyter notebook, we exported the model as a Python file, copied the file to a Linux server, and then set up a cron job to run the model at a regular interval. Every time we improved the model, we had to repeat this same chore. It wasn’t terribly difficult, but it was repetitive and dull.

At the same time, we were learning about the second part of deployment. The moment we put a model into production, it began to degrade as the business setting inevitably started to change. We could see that instruments for monitoring the health of the models were going to be essential.

Again, we found a relief in the latest release of DSX, which comes with a Model Management capability. In DSX, we can build our models, and then when we’re ready to deploy, we can set up a one-time job and deploy the model within DSX with a few clicks. No need to export models to another system for manual deployment. On top of that, we can see the performance of the model between trainings within a built-in performance monitoring dashboard. The following is an example of a performance monitoring graph, provided by DSX, for a classification model.

Banished Frogs

Those are just three of the many capabilities in DSX that have already boosted the productivity of our own data science team. It’s truly liberating to spend our time doing the things that are actually enjoyable — exploring data, and building and tuning models. If you’re curious to know more about DSX, details are here.

As we continue our blog series, we’ll discuss the process of selecting an initial project and evaluating projects on their conditions of success.

___________________________________________________________

Follow us on twitter @ShaikhQuader and @ChadEMarston

--

--

Shaikh Quader
Inside Machine learning

A machine learning researcher who lost 50 lbs of weight and experiments with self-discipline, habits, creative thinking, learning, and wellbeing.