Enacting challenges of data science pipelines

Published in

Learning based data science pipelines for setting up data centric sciences experiments

3 min readDec 19, 2019

Huge collections of heterogeneous data have become the backbone of scientific, analytic and forecasting processes. Combining simulation techniques, artificial vision and artificial learning with data science techniques[1] it is possible to compute mathematical models to understand and predict phenomena. Such models can let scientists, decision makers and civilians transform their hindsight[2] about phenomena into insight and foresight[3]. Eventually, scientists can acquire full control of phenomena and figure out how they can reproduce them. To achieve this ambitious objective, data must go through complex and repetitive processing and analysis pipelines, namely data science pipelines.

The enactment of data science pipelines must balance the delivery of different types of services such as: (i) hardware (computing, storage and memory), (ii) communication (bandwidth and reliability) and scheduling (iii) greedy analytics and mining with high in-memory and computing cycles requirements. Current data science environments (e.g. Microsoft ML environment) have particularly focused on the efficient provision of computing resources required for processing data through greedy analytics algorithms. Beyond, the execution of such tasks using parallel models and their associated technology, data management is still an open and key issue. How to distribute and duplicate data across CPUs/GPUs farms for ensuring their availability for executing parallel processes? How should data be organized (loaded and indexed) in main memory to perform efficient data processing and analytics at scale?

The emergence of new architectures like the cloud have opened new challenges for executing the tasks that compose data science pipelines. It is no longer pertinent to reason with respect to a set of computing, storage and memory resources, instead it is necessary to design algorithms and processes considering an unlimited set of resources usable via a “pay as U go model”. Instead of designing processes and algorithms considering as threshold the resources availability, the cloud imposes to consider the economic cost of the processes vs. resources use, and the exploitation of available resources.

Many efforts in databases, data mining, distributed systems and data science have been motivated by the vision of generating tailored systems for specific emerging data centric sciences experiments. These experiments are implemented by several data science pipelines. Every data science pipeline has its own requirements regarding the software stack, communication requirements, and analytics paradigms. Thus, recent research has tried to push the boundaries of tailored designs rethinking parts of the stack of a database system. Frameworks for using, authoring, and training machine learning systems have proliferated. These different frameworks often have dramatically different APIs, data models, usage patterns, and scalability considerations. This heterogeneity makes it difficult to combine systems and complicates production deployments. Hence, the time calls to think of environments that can automatically adapt to target data science pipelines. By itself adaptation has always been a goal and many efforts have made big steps in this direction. People have used different names for such systems such as ”adaptive, auto-tuned,just-in-time”. Most trendy research works adopt the new angle named “synthesis and learning” which can potentially push the limits of adaptive systems to the next level also by utilizing past techniques.

The challenge is to design and enact self-adaptive data science pipelines separating (i) their design; (ii) from issues concerning the setting of environments that best provide computing and data management strategies. In the past, this kind of aspects separation has led major data management solutions. This project relies on this philosophy for exploring the design and enactment of data science pipelines.

[1] Data science techniques include stochastic models, data mining and automatic learning.

[2] Hindsight: understand what happened.

[3] Insight and foresight: resp. understand what did/will happen.

Enacting challenges of data science pipelines

Written by Genoveva Vargas-Solar