Faster Collaborative Data Science

Behrouz Derakhshan
The  Agora Technology Blog
5 min readMay 19, 2021

Data science and machine learning are interdisciplinary fields. Effective data science requires knowledge in math and statistics, computer science, and domain knowledge. On top of that, there are many different tools and software products, such as R, python, scikit-learn, that one needs to master.

Data Science is an Interdisciplinary Field

Becoming an expert in all of these fields, tools, and software products is difficult, if not impossible.

High-quality data Science and machine learning applications require effective collaboration.

Kaggle and Google Colab are prime examples of how collaboration can advance the state-of-the-art. You can write a notebook and make it available publicly to thousands of other interested users. In Kaggle, this has helped in creating better data science solutions. One problem, however, is the lack of resources for everyone. To help alleviate that, Kaggle and Google Colab enforce some constraints (disk and memory limit, as well as the time limit). Since GPUs are more expensive, there is only a limited number of them available. Therefore, one might need to wait in a queue until a notebook with GPU is available.

Solution: Materialization and Reuse Optimizations

Materialization and reuse optimizations in databases have been studied for over 30 years [1, 2, 3]. Essentially, these optimization techniques have to deal with two aspects:

  1. Find costly and heavy computation to cache their results (usually with a limited cache budget).
  2. When new computations are being executed, find a matching cached computation to load instead of computing everything.

In our SIGMOD paper [4], we investigate how computational redundancies can impact collaborative environments such as Kaggle and Google Colab. Then, we offer algorithms for selecting what computations to materialize and algorithms for finding cached results when new data science scripts are being executed.

Lots of Redundant Data Processing

Kaggle and Google Colab make it easy to publish a notebook. This means that every time a user wants to find the final result (a trained model) or a partial result (an intermediate preprocessed dataset) of the notebook, they have to re-execute the notebook. Only then, they can move on and apply their code to the result of another user.

Top 3 notebooks of Home Credit Default Risk competition generate 100s GBs of intermediate data and are copied more than 10,000 times.

Let’s look at a real example, the Home Credit Default Risk Kaggle competition ran from May 18, 2018, until August 30, 2018. The top three notebooks (based on the number of votes) were copied more than 10,000 times. Each run, generate 100s GBs of data. Furthermore, the reported number of copies by Kaggle only indicates how many users ‘forked’ the notebook in their own local user space. Kaggle does not report the actual number of execution of the notebooks (or their modified version), which can be many times more.

The intermediate data and models that are generated in every run are typically thrown away at the end of the run. Although users can choose to manually store some of the data that they feel is important, there is no guarantee that the saved data is what other users are looking for. Typically, each user ends up running the notebook from the beginning up to the point where the specific intermediate preprocessed data or the ML model is located.

The lack of data management in existing collaborative environments leads to 1000s hours of redundant data processing and model training.

The first part of the solution is to find the appropriate set of intermediate preprocessed data and trained ML models, which we refer to as artefacts, to store. The goal is to store the artefacts in a central database, where every notebook is connected to. The second part of the solution is to find matching artefacts. When users run a notebook, we must look through the database and look for any artefact that matches the artefact in the notebook. Then, instead of executing the computation in the notebook, we simply load the artefact from the central database. It is important to note that the whole process is transparent to the user and should not disrupt their normal flow. This means that users still write their python code (that’s what we currently support) as if there is no central database that contains the artefacts. Our solution examines the code and looks for matching artefacts in the database. Details of the materialization and reuse algorithms can be found in the paper [4].

Results

To test our optimizations, we use three notebooks from the Home Credit Default Risk Kaggle competition:

  1. Notebook 1: Start Here A Gentle Introduction
  2. Notebook 2: Introduction to Manual Feature Engineering Part 1
  3. Notebook 3: Introduction to Manual Feature Engineering Part 2

These notebooks are copied more than 9,000 times. In our experiment, we run each notebook twice. Currently, Kaggle does not support any form of materialization and reuse. As a result, the run time of both Run 1 and Run 2 of each notebook is similar (denoted by KG in the figure). Using materialization and reuse optimizations (denoted by CO in the figure), after the first run some of the key artefacts are materialized (i.e., the artefacts that are compute-heavy and result in high-quality models). In the second run, many of those artefacts are reused instead of executed; thus, reducing the run time by an order of magnitude.

Repeated Executions of Kaggle Notebooks. From Left to Right, the notebooks are: Gentle Introduction, Intro to Manual Feature Engineering 1, and Intro to Manual Feature Engineering 2

Here, we only show the result of two runs. In the actual Kaggle scenario, the notebooks are potentially executed tens of thousands of times. This can lead to hundreds of hours of reduction in run time. Which can not only result in a more interactive experience for the users but also reduces the user wait time as well as the compute-resource cost for platforms such as Kaggle.

References

[1] Blakeley, Jose A., Per-Ake Larson, and Frank Wm Tompa. “Efficiently updating materialized views.” ACM SIGMOD Record 15.2 (1986): 61–71.

[2] Roy, Prasan, et al. “Efficient and extensible algorithms for multi query optimization.” Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 2000.

[3] Mami, Imene, and Zohra Bellahsene. “A survey of view selection methods.” Acm Sigmod Record 41.1 (2012): 20–29.

[4] Derakhshan, Behrouz, et al. “Optimizing machine learning workloads in collaborative environments.” Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020.

--

--