Moving Fast, Productionizing Data Science, and Breaking Things

Shanif Dhanani
5 min readOct 18, 2017

--

A few weeks ago, I wrote a post about the data science workflow we use at Apteo. As you can tell, the majority of our efforts are spent on things that help us figure out our data, build our models, and get an ad-hoc data collection pipeline going.

And that’s where our efforts should be for a new data science task.

But this post isn’t about any of that. It’s about the last step.

That last step, which I admittedly glossed over in the original post, is one of the more important ones when it comes to actually running a sustainable business, because it’s the one that’s the most “user-facing” (even if your users are the members of your internal product or data science team).

In order to actually put any of your hard-won data science feats into practice, you need to have a system that supports its use.

Recently, Carlos Perez wrote a fantastic article about how Google and Uber handle their productionized machine learning systems. These are amazing systems, but they’re much harder for a small startup to build and put into place.

We’re in the process of productionizing our predictive infrastructure now as quickly as we can, and as expected, we’ve learned a few things. Specifically, we’ve moved fast, implemented a lot of new things, broken a lot of things along the way, and are now working on getting rid of that last bit of tech debt before we jump into our next big project.

Timeline

Before I jump into some of the problems we’re working through now, it helps to understand where we’ve come from.

To date, we’ve been spending most of our efforts building and researching models that will help us understand stock price movements based on fundamental analysis techniques used by equities analysts.

We’ve been able to build an initial prototype of that model, and we’re now in the process of using it to make forward predictions, and take actions based on those predictions.

In order to do so, we needed to implement a process of continuous and systemized data collection, alerts on errors in the data collection pipeline, continuous and systemized model training, and alerts in the model training pipeline.

Those sound straightforward, but there’s some complexity to these processes that aren’t immediately obvious.

Collecting Up-to-Date Data Ain’t Easy

In order to make predictions using our previously trained models, we need up-to-date data that reflects the current state of the world.

Getting this data in a one-off basis for training our models was relatively easy, even if we needed to collect a lot of data at once.

Getting this data for live predictions hasn’t been as straightforward.

Essentially, our process of collecting data requires us to continuously gather data from a variety of sources on a periodic basis. Doing so required us to write code that could spawn instances on AWS, run scheduled apps that could interface with multiple different providers to gather data, and then store that data in a usable format in our systems.

Logging

It’s easy to know if our apps ran properly overnight — all we’d have to do is wake up and see if we had new data.

But what happens when that data isn’t there?

Presumably, the instance that ran our app already terminated in order to save money. But maybe our app crashed, or maybe our data providers weren’t responsive.

We quickly learned that in order to debug data collection issues, we needed to implement a more robust and sophisticated method to collect and consolidate our logs.

In order to do that, we’ve had to add additional code that can upload our logs to a centralized repository for later perusal.

At some point, it would be great to get tools like Logstash and Elasticsearch up and running, but with a smaller team whose main focus is on the development of heavy duty models for application to finance, that’s more of a nice to have.

Alerting

In addition to basic log management, we’ve also had to develop a method for alerting us when things go wrong.

Not sure if you’ve ever worked with alerts before, but they suck. They really suck.

Before Apteo, I had to deal with PagerDuty alerts for years. I never got sleep when I was on call. Fortunately we haven’t had to resort to anything that heavy duty quite yet. But we have had to implement a system that emails us whenever something doesn’t go as expected, like whether we don’t have new data or whether a job didn’t run.

Alerting systems themselves require maintenance and sometimes your alerter needs to have alerts as well, so this is a bit of a catch-22, but having some alerting is better than being completely in the dark.

Integrating With Providers

Usually, integrations with data providers aren’t too bad. You do it once and as long as nothing changes on their end, you’re good to go.

The problem is that things do change on their end. Which means you need to periodically test your integrations and make sure that what you’re getting back from providers is actually what you expect.

And when it’s not what you expect, you need to fix it. Unfortunately this is just a fact of life. The best providers will publish documentation about what has changed. That’s when you can count yourself lucky.

Running Live Apps vs. Training and Backtesting Apps

Collecting and monitoring data efficiently isn’t the only problem we’ve run into.

Sometimes, a scheduled job, be it a data collection job, a training job, or a prediction job, doesn’t behave as it should.

The problem may be that the data that’s needed to predict isn’t available, or that it’s available, but it’s just weird, and the output you get from one run to another is different, or that the way it performs on live data is different than what it did on historical data.

There are a lot of weird issues here, and debugging them takes a lot more time than when you’re working with historical data, because you need to understand the new data you’re working with, check your log files to see what errors cropped up, fix the issue, add a unit test to make sure you’ve fixed the issue (you are using unit tests, right?), and then re-run the app, which itself could take days.

So the feedback cycle in live apps is just longer, which means you need to manage the risk of your live apps eating up all your developer time.

As I mentioned above, we do that by extensively developing and relying on unit tests (and integration tests and functional tests), and relying on a continuous integration system to make sure nothing fails when new code is being added to our main development trunk.

Wrapping Up

You’ll frequently hear data scientists say that 80% of their job is data munging and cleaning.

Well, I’ll confidently make the claim that 80% of the job of the management team of a data project is deploying and maintaining production systems. OK, maybe not that much, maybe 75%, but it’s high.

We’ve been thinking about open sourcing some of the stuff we use to manage all of our systems, because we’ve found there hasn’t been a great fit with what’s out there and what we’ve needed. If anyone’s interested, please let us know.

Otherwise, I’d love to hear your comments about how your company as a small startup handles these issues.

--

--

Shanif Dhanani

Creating software for businesses that want to use their data with AI. Learn more at https://www.locusive.com.