Data science pipelines view data mathematically (e.g., measures, values distribution) first and establish a context for it later relying on computed models that approximate insights and foresights of the phenomena they represent. What a data scientist really would want to be doing is looking at the whole data set in ways that tell her things and answer questions that she is not asking. The pipelines design and its results remain empirical and partially explicit on how statistical tools and computer technologies are used to identify meaningful patterns of information? How shall significant data correlations be interpreted? What is the role…
DBMS architectures have evolved to the notion of service-based infrastructure where services are adapted and coordinated for implementing ad hoc data management functions (storage, fragmentation, replication, analysis, decision making, data mining). These functions are adapted and tuned for managing huge distributed multiform multimedia data collections. Applications can extend the functionality of DBMS through specific tasks that have to be provided by the data management systems, these tasks are called services, and allow interoperability between DBMS and other applications [11].
A service-based DBMS externalizes the functions of the different systems’ layers and enables the programming of personalized data management as service…
Since the emergence of the 5V’s (i.e., n-V’s) models describing non-functional properties of data, new visions of querying have emerged. Batch, on-demand queries, with expected complete and sound results, have evolved into complex data science pipelines combining processing and analytics tasks. Similar to a query, described as a data flow, a data science pipeline is a combination of tasks. Different to classic queries that rely on well-defined data structures with associated operators. Data science pipelines combine data visualization, cleaning, preparation, modelling and prediction, and assessment tasks. These tasks use input data with different structures. …
Huge collections of heterogeneous data have become the backbone of scientific, analytic and forecasting processes. Combining simulation techniques, artificial vision and artificial learning with data science techniques[1] it is possible to compute mathematical models to understand and predict phenomena. Such models can let scientists, decision makers and civilians transform their hindsight[2] about phenomena into insight and foresight[3]. Eventually, scientists can acquire full control of phenomena and figure out how they can reproduce them. To achieve this ambitious objective, data must go through complex and repetitive processing and analysis pipelines, namely data science pipelines.
The enactment of data science pipelines must…