Expand your training limits! Generating Training Data for ML-based Data Management

Francesco Ventura
The  Agora Technology Blog
6 min readJun 11, 2021

Joint work by Francesco Ventura, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl.

Machine Learning (ML) is quickly becoming a prominent method in many data management components, especially in query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries.

In this post, we analyze the limits of the current solutions and discuss DataFarm [1], an innovative framework for efficiently generating and labeling large query workloads. DataFarm enables users to reduce the cost of getting labeled query workloads by 54× (and up to an estimated factor of 104×) compared to standard approaches.

Checkout DataFarm on GitHub: https://github.com/agora-ecosystem/data-farm

Collecting Training Data for ML-Based Data Management is a Problem!

Nowadays, we are facing an increasing number of machine learning applications in modern data management. For example, a number of works have demonstrated how it is possible to improve the performance of query optimization by introducing ML-based algorithms.

However, researchers have to satisfy some requirements to successfully apply machine learning to their pipelines in terms of quantity and quality of training data. Indeed a proper amount of heterogeneous data is necessary to allow a machine learning model to learn even the most complex patterns. Also, the feature engineering process has to extract sets of meaningful and high-quality characteristics from the data. Finally, supervised machine learning requires labels that in the case of data management can be the execution times or the output cardinalities of the jobs.

Sadly, very few researchers talk about the elephant in the room: collecting training data for ML-based data management is a real and complex problem!

Limits of the Current Solutions

To generate training data for data management, one has to manually create thousands of different jobs and then execute them all to collect labels and extract features. As you can imagine this can be a very tedious task. So, following the current practice of manually collecting labels, you may end up executing thousands of jobs with very large workloads and this can require days or even months (Figure 1).

Figure 1. Runtimes for manually collecting query labels.

We extrapolated that the manual collection of labels can take more than 6 months if executing 10 000 jobs on one terabyte of data.

Also, the current state-of-the-art, i.e. TDGen [2], is limited by its heuristic approach. It approximates the labels for the jobs through polynomial forecasting and it does not provide any warranties of producing representative workload for the user use case.

The DataFarm Framework

To solve the lack and the problems related to the generation of training data for data management components tailored to users’ needs, we developed DataFarm.

DataFarm is an innovative framework for efficiently generating and labelling large query workloads.

We follow a data-driven & white-box approach to learn from pre-existing small workload patterns, input data, and computational resources. Thus, our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component.

Figure 2 shows the main components of our framework. The DataFarm framework is divided into three main components: the Abstract plan Generator, the Synthetic Job Instantiator, and the Label Forecaster.

Figure 2. DataFarm’s training data generation process.
  • Abstract Plan Generator: The Abstract Plan Generator analyzes the user’s real workload and produces abstract plans. It learns the execution patterns and generates new plans that are representative of the real ones.
  • Synthetic Job Instantiator: The Synthetic Job Instantiator creates an augmented set of executable jobs for each input abstract plan. It allows to include real input data metadata and custom user-defined functions (UDF) in the instantiation process. The result of this step is an augmented set of executable jobs.
  • Label Forecaster: First, the Label Forecaster characterizes the generated jobs by means of interpretable and representative features. Then, it starts an Active Learning process based on Quantile Regression Forest. It iteratively exploits the uncertainty of the model to selects a small number of jobs to execute on the user’s computational resources and it predicts the labels for the non-executed ones along with their uncertainty values.

The final output of the whole process is the augmented set of jobs along with forecasted labels and their uncertainties associated with the non-executed job instances.

Accurate Labelling of Non-Executed Jobs

Thanks to its Active Learning approach, DataFarm predicts the labels by executing only a few jobs from the generated query workload.

Let’s consider the generation process of 2000 jobs starting from 6 TPC-H queries. We test DataFarm by predicting jobs execution times. To measure the quality of the forecasted labels for the non-executed jobs we computed the R² score among the ground truth values collected executing all the jobs manually and the values predicted by DataFarm.

In our experiments, DataFarm achieves R² = 0.67 by executing just 142 jobs, meaning that the model has properly learned the performance trend of the synthetic jobs (Figure 3).

Figure 3. forecasted labels with uncertainty by executing 142 jobs. R² = 0.67.

Also, you can notice that each predicted label is associated with its uncertainty range.

These uncertainty values help the user in better understanding the performance of the predictive model and to decide if to continue with the learning process.

So, thanks to DataFarm one can avoid the execution of more than 1800 jobs saving a considerable amount of time and reaching a very good accuracy in terms of predicted values for the non-executed jobs.

Effectiveness and efficiency of DataFarm

In our experiments, DataFarm significantly improves the performance with respect to manually collecting labels.

There are two factors that can affect the time required to collect the labels: the size of the input data, and the number of the generated jobs.

In Figure 4 we measured that, increasing the size of the input data up to 50 GB and the number of generated jobs to 2500, DataFarm improves the performance of labelling the jobs respectively of 3.4 and 16 times. So, combining the two results the total improvement factor will be 54 times.

Figure 4. Effectiveness and efficiency of the framework.

Also, we extrapolated that if increasing the input data to one terabyte and the number of jobs increased to 10.000 the improvement factor will improve reaching 104 times.

Summary

In conclusion, DataFarm is an effective and efficient framework for generating training data for ML-based data management that takes into account the user’s already existing workload and input data. Furthermore, it provides accurate labels to the generated jobs executing only a small amount of them and taking into account the user’s computational resources.

Moreover, one can achieve more than two orders of magnitude of improvement factor with respect to manual labeling.

DataFarm can be easily extended to support any data management platform and other kinds of target label generation such as output cardinality.

Acknowledgments

This work was funded by the German Ministry for Education and Research as BIFOLD — Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A).

References

[1] F. Ventura, Z. Kaoudi, J. Quiané-Ruiz, and V. Markl. Expand your Training Limits! Generating and Labeling Jobs for ML-based Data Management. In SIGMOD, 2021.

[2] Z.Kaoudi, J.Quiané-Ruiz, B.Contreras-Rojas, R.Pardo-Meza, A.Troudi, and S. Chawla. 2020. ML-based Cross-Platform Query Optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE).

--

--