Enacting challenges of data science pipelines

Huge collections of heterogeneous data have become the backbone of scientific, analytic and forecasting processes. Combining simulation techniques, artificial vision and artificial learning with data science techniques[1] it is possible to compute mathematical models to understand and predict phenomena. Such models can let scientists, decision makers and civilians transform their hindsight[2] about phenomena into insight and foresight[3]. Eventually, scientists can acquire full control of phenomena and figure out how they can reproduce them. To achieve this ambitious objective, data must go through complex and repetitive processing and analysis pipelines, namely data science pipelines.

The enactment of data science pipelines must balance the delivery of different types of services such as: (i) hardware (computing, storage and memory), (ii) communication (bandwidth and reliability) and scheduling (iii) greedy analytics and mining with high in-memory and computing cycles requirements. Current data science environments (e.g. Microsoft ML environment) have particularly focused on the efficient provision of computing resources required for processing data through greedy analytics algorithms. Beyond, the execution of such tasks using parallel models and their associated technology, data management is still an open and key issue. How to distribute and duplicate data across CPUs/GPUs farms for ensuring their availability for executing parallel processes? How should data be organized (loaded and indexed) in main memory to perform efficient data processing and analytics at scale?

The emergence of new architectures like the cloud have opened new challenges for executing the tasks that compose data science pipelines. It is no longer pertinent to reason with respect to a set of computing, storage and memory resources, instead it is necessary to design algorithms and processes considering an unlimited set of resources usable via a “pay as U go model”. Instead of designing processes and algorithms considering as threshold the resources availability, the cloud imposes to consider the economic cost of the processes vs. resources use, and the exploitation of available resources.

Many efforts in databases, data mining, distributed systems and data science have been motivated by the vision of generating tailored systems for specific emerging data centric sciences experiments. These experiments are implemented by several data science pipelines. Every data science pipeline has its own requirements regarding the software stack, communication requirements, and analytics paradigms. Thus, recent research has tried to push the boundaries of tailored designs rethinking parts of the stack of a database system. Frameworks for using, authoring, and training machine learning systems have proliferated. These different frameworks often have dramatically different APIs, data models, usage patterns, and scalability considerations. This heterogeneity makes it difficult to combine systems and complicates production deployments. Hence, the time calls to think of environments that can automatically adapt to target data science pipelines. By itself adaptation has always been a goal and many efforts have made big steps in this direction. People have used different names for such systems such as ”adaptive, auto-tuned,just-in-time”. Most trendy research works adopt the new angle named “synthesis and learning” which can potentially push the limits of adaptive systems to the next level also by utilizing past techniques.

The challenge is to design and enact self-adaptive data science pipelines separating (i) their design; (ii) from issues concerning the setting of environments that best provide computing and data management strategies. In the past, this kind of aspects separation has led major data management solutions. This project relies on this philosophy for exploring the design and enactment of data science pipelines.

[1] Data science techniques include stochastic models, data mining and automatic learning.

[2] Hindsight: understand what happened.

[3] Insight and foresight: resp. understand what did/will happen.

Learning based data science pipelines for setting up data centric sciences experiments

Hardware, communication and scheduling, and greedy…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store