Photo by Tobias Fischer on Unsplash

Evolving an Organization’s Data Management

Supporting Machine Learning Activities with More Disciplined Data Management

Dave Cohen
4 min readAug 31, 2022

--

In this blog, I’ll discuss the use of a native, C++-based shared library as a means of encapsulating access to data and generalizing the computations applied to that data. Such a library represents the refactoring of bespoke, application-specific code into a general capability that is leveraged across applications in an organization.

An early example is Microsoft’s TRILL, with its use in their SCOPE and YARN deployments. Google’s F1 Query client library serves as a second example. This library is used within Google by a diverse set of workloads that includes their Dataflow stream processing engine, the Procella near-real-time query engine, and their Napa data lake. Google and Microsoft were early innovators in the use of MapReduce frameworks and compute-store disaggregation. It is within these “data lake” deployments where the use of a shared, database acceleration library has taken hold.

In parallel to Google and Microsoft, Databricks and their Spark project emerged out of the University of California, Berkeley. Around the same time, Meta (formerly Facebook) released their Presto SQL query engine. Both Spark-SQL and Presto SQL are massively parallel processing query engines that operate over data that resides in remote, disaggregated storage. In other words, the Databricks and Presto query engines operate within the same data lake scope as Google and Microsoft. The difference between the two is that Spark-SQL is a Scala-based application that runs over the Spark computational framework. In contrast, the Presto query engine is independent of a particular computational framework. In fact, the Presto community has enabled Presto to operate as a Scala-based, Spark application. In this deployment, the Presto capability is delivered as a Java-based library.

Databricks has recently introduced their native, C++-based Photon database acceleration library as part of their Spark-based, software-as-a-service (SaaS) platform. Photon is proprietary to Databricks, so like Google’s F1 Query and Microsoft’s TRILL libraries, not available to the general developer community. With today’s announcement of the open-source software (OSS), native, C++-based Velox database acceleration library, developers have an alternative to these closed-source, proprietary libraries.

For more than a year, Intel has been working with Meta, Ahana, Voltron, and others in the Velox community. Our contributions are two-fold. First, we are working with Meta and Voltron to use the Apache Arrow project as the standard columnar representation for data exchange between the Velox library and other libraries and applications. Intel is extending Velox with support for an abstract representation of a query engine’s computational model using the Substrait.io project. Second, Intel’s OSS-based OAP project has been refactored to support the Velox library.

We envision a diverse set of query engines being refactored to offload their computations to the Velox library. These include stream processing engines such as Apache Flink, near-real-time query engines such as Apache Druid, Apache Pinot, along with others. Data lake query engines such as PrestoDB, and the data management pipelines supported in machine learning (ML) frameworks such as PyTorch, TensorFlow, and more.

We are working on a set of data management pipelines in support of a recommendation system platform. These include:

  1. A PrestoDB-based offline, batch ETL pipeline to write data in Parquet format to the data lake.
  2. A PyTorch-based preprocessing pipeline to handle the transformation of these Parquet files into dense, feature embeddings to be consumed by GPUs in the training of the models used in the recall and ranking processes of the recommendation engine.
  3. A Spark-SQL-based vector similarity search capability:
    First, a batch serving pipeline that can be configured to use libraries such as FAISS, HSNW, and others to generate the feature embedding inverted index based on the pretrained model.
    Second, query engine that executes the user’s vector similarity search using the index generated in the previous step.
  4. A Flink-based pipeline that processes micro-batches of user interaction log events to update feature embeddings used in the pretrained ranking model.

The motivation for developing these pipelines is to refactor the data management functionality used to support the recommendation system as an isolated set of capabilities so we can work with the community in optimizing the end-to-end flow of data through the recommendation system.

I will be presenting more details on this “End-to-End Data Management in Support of an ML RecSys” at the upcoming Composable Data Management Systems (CDMS) workshop on Friday, September 9. The workshop is part of the annual International Conference on Very Large Databases (VLDB) being held this year in Sydney, Australia. Meta will publish their “Velox: Meta’s Unified Execution Engine” paper in the VLDB industrial track, but in the meantime you can read Introducing Velox: An Open Source Unified Execution Engine.

--

--