How to build a data science pipeline
Balázs Kégl
83010

Interesting post! This is how the process should be organize top down.

It works only with the assumption that you know what you are optimizing (Y) and what is needed for the optimization (X). It is not a problem for traditional scenarios like CTR prediction but for new areas and projects identifying Y and X might be a problem even for business units. Or Y might be defined but not realistic from data science point of view (this is why we predict CTR, not revenue even though revenue prediction is better metrics for business units).

For these new, undefined projects bottom up process is better fit — build model first, make sure it reflects your business need and then productize it using the top down approach (engineers) and optimize the production model (data scientists again).

We build open source tool dataversioncontrol.com to keep these two approaches synced. The tool helps to incrementally build model by creating workflows and then the modeling-workflow can be easily injected to production pipelines.

Like what you read? Give Dmitry Petrov a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.