The Apteo Data Science Workflow

Shanif Dhanani
5 min readSep 30, 2017

--

The Need for a Process

Creating a new product from scratch is a complicated and long endeavor. Creating a new data-product from scratch is at least as complicated and long, and brings its own set of very specific challenges.

So we’re on the same page, I’m using the term data-product to refer to any digital product (or even service) that leverages advanced analytics built on top of a dataset that needs to be maintained indefinitely.

When you’re building up one of these beasts, it’s easy to lose track of where you are and where you’re going. You can easily get caught up in the minutia of optimizing a sub-system that ultimately doesn’t matter, or spend tons of time on a data-engineering task that may ultimately be unnecessary.

At Apteo, we’ve done a lot of work to get our initial software up and running, and in the process, we’ve learned a lot about what works and what doesn’t. One of the things that has helped us through this complexity is to have a defined workflow that provides us with some structure around where we should spend our time and efforts.

Having a defined process or workflow that provides us with a framework under which we can proceed allows us to understand where we are in our process, and it allows us to course-correct if we find ourselves proceeding down a path that’s not ideal, urgent, or important.

Over time, we’ve developed a process that has worked for us. Of course, the process wasn’t immediately obvious or intuitive. Unlike pure software development, data-products require specialized resources and additional steps within the development lifecycle.

There are a lot of blockers related to the very nature of data science. Oftentimes you don’t know whether it will be worth it to go down a path of investigation. You have no idea if adding new features will result in better or worse performance. With any data-product, there’s always going to be some amount of research that needs to be done in order to guide the team’s efforts.

Sometimes you don’t know if you can actually accomplish what you hope to, either because you don’t have the data you need, or the data you have isn’t predictive of the objective you’re trying to optimize. You also need a good mix of analysis, ML engineering, data engineering, and model tweaking and experimentation. Things like exploratory data analysis (EDA), building a baseline model, and maintaining and updating a golden set are all crucial to success.

Finally, all of this may vary depending on whether you’re developing a brand new product or you’re trying to improve an existing one. Suffice it to say, there are a lot of considerations when you’re working through an ML/data task. That’s why a defined workflow helps.

Our Workflow

A fellow named Eren Golge created a handy data science workflow that he posted in this article. It’s a useful piece that discusses his understanding of the workflow proposed in one of Andrew Ng’s courses.

Our workflow incorporates a lot of the same ideas, however it also involves things that need to be done when creating a productionized data-product. The image below shows a graphical representation of our workflow, followed by a very brief summary of the key points.

Workflow Discussion

I won’t go into each step, since a lot of this is self-explanatory. However, what I do want to do is provide a high-level overview of some of the key areas that we try to address, and provide some insight into why our workflow is structured as it is.

You can see that when we take on a new data science task, we have a lot of initial work to do to define where we are and where we’re going. Part of this is understanding the business case for what we’re doing, part of it is understanding how we can optimize what we want to optimize, and part of it is understanding the data that we think will get us to where we want to be.

This initial roadmap definition and analysis work is crucial, but you can see that it doesn’t necessarily fit into your typical software engineering workflow. It’s important to take a data science-specific approach to these sorts of tasks, because if you approach it with the mindset of simply building and coding a typical digital product, you’re going to miss out on key insights around your data, and having those insights will help guide your modeling efforts.

You may also notice that we start small and expand, both in terms of tasks, complexity, and assumptions. We gather data, then we explore it. We start with a simple, baseline model to see how well we can do with an unsophisticated approach, then we see how much better we can do with more complex models.

We add feature engineering at the start, then continue to add additional features as the project progresses. We only start to tweak model structure and optimize hyperparams once we’re convinced that we can address our business case effectively given our resources, and that we have promising results from our initial modeling efforts.

This workflow allows us to de-risk, we address those tasks that could derail our entire efforts at the beginning, we then address all other key assumptions towards the middle, and then we optimize, tweak, and make things repeatable towards the end.

This may sound like a linear flow, but it’s also important to note that we cycle through the process of adding features, training models, and testing performance continuously.

This allows us to continue our research efforts while we productionize our baseline, and then when we find a model or feature that gives us better production, we productionize that model or those features as well.

The whole thing is very iterative, and it mixes ideas from traditional agile development with more statistical, quantitative work.

So far this has worked well for us. No doubt as we grow and we start to develop new and more complicated models, we’ll need to change our process around.

That’s something I’m looking forward to doing.

--

--

Shanif Dhanani

Creating software for businesses that want to use their data with AI. Learn more at https://www.locusive.com.