ML Prediction Pipeline Orchestration with ML Control Center

A YAML based approach for building data processing pipelines

Vance Thornton

Published in

Glassdoor Engineering Blog

7 min readApr 1, 2022

Introduction

In a previous blog post we described Glassdoor’s a la carte approach to MLOps. We chose this approach because the MLOps landscape is rapidly evolving and we want to have the flexibility to use the best available options. ML Control Center (MLCC) is a project that we are developing at Glassdoor which will act as the glue to connect and unify these disparate components. Our plan is to move to a GitOps oriented approach where YAML configuration files in our Git repositories control the deployment and execution of all the tasks and services needed for our ML projects.

One part of this is support for ML prediction pipelines which we recently released as an open source project at: https://gitlab.com/glassdoor-open-source/ml-control-center. A ML prediction pipeline typically involves gathering feature data from various sources, providing the data to feature extractors and/or ML models for prediction, and then writing the output to a data store. In MLCC these pipelines are defined using YAML files which specify the configuration of the operations to perform and the flow of data input/output of those operations. One of our primary design goals is to make it easy to implement the most common use cases quickly with simple YAML configuration and minimal coding. We want to allow ML scientists and engineers to focus on defining what they want the pipeline to do with many of the engineering implementation details such as parallelization, metrics, and retry logic taken care of automatically. MLCC has a library of reusable components which provide the functionality that is typically needed for our use cases and it is easy to add custom components when needed. This approach promotes flexibility and reuse by encouraging a modular implementation.

Another trend that we have seen at Glassdoor is that online and near real time ML processing is becoming more common. The design of MLCC prediction pipelines makes them well suited to this. The transfer of input/output data between pipeline steps is done in memory on the same machine so low latency can be achieved. The same pipeline definition can often be used in different contexts. For example to process individual items for online / real time use cases, batches of items for offline / batch use cases, or for automated testing. Configuration properties can be used to customize the behavior of a pipeline for specific use cases when needed.

MLCC Architecture

MLCC is responsible for executing the pipelines defined in the YAML config files. This involves operations such as reading data from data stores and calling REST services to fetch the data needed for model prediction. Once the required data has been fetched MLCC will use it to build a request that is sent to the model which is served using MLFLow. The model prediction result is written to the Feature Store or other data store. MLCC has a REST interface for manually triggering updates, viewing task status information, and performing management functions such as cleanup. Each component in the pipeline publishes metrics such as execution time, error count, etc. which are used for monitoring and alerting.

Example: Review Intelligence

At Glassdoor we use ML to extract the topics mentioned and sentiment from company reviews. The ML prediction pipeline for this has the following steps:

Read reviews from the SQL database
Send a REST request to get the model prediction result
Build an Elasticsearch document from the review and model prediction result
Write the documents to the Elasticsearch index used for real time analytics
Write the extracted data to a Feature Store for offline analysis

YAML Configuration

This pipeline is implemented using the MLCC configuration described below:

feature:
  id: reviews
  version: 1
  store:
    type: sql
    clusterId: primary
  key: [reviewId, source]
  schema: reviewId:int, source:string, languageId:string,
          mentionsAttribute:json, questionAttribute:json,
          reviewCategoryAttribute:json, highlightAttribute:json

This section of the configuration defines the feature group id, version, key, schema, as well as information about the feature store to use.

Fetching Input Data

The input to the pipeline is specified in the input section of the config. In this case we fetch the review data from the SQL database using a SQLQueryValueProvider. SQLQueryValueProvider is part of the library of reusable components provided with MLCC.

input:
- !SQLQueryValueProvider
  id: glassdoor
  clusterId: reviews
  baseQuery: >
    SELECT r.FK_reviewId AS reviewId,
      r.FK_languageId AS languageId,
      ...

MLCC drives the process of calling the value provider to fetch the review data to process. It will pass in the appropriate parameters to the value provider depending on the use case to fetch all reviews for a full update or recently changed reviews for an incremental update, or individual reviews for a specific ids update. MLCC takes care of retry logic and this behavior can be customized in the configuration. MLCC also takes care of the logic to manage the last update time for incremental updates.

Data Processing Pipeline

The pipeline to process the input records is specified in the pipeline section. This same pipeline definition is used for full update, incremental update, and specific ids update. It can also be run in dry-run mode for testing purposes. The first step of the pipeline is to send a REST request to perform model prediction. This is done with a RESTServiceValueProvider:

pipeline:
- !RESTServiceValueProvider
  id: modelPrediction
  serviceId: reviews
  path: /extract
  input:
    body:
      clientType: mlControlCenter
      data: $(inputItems)
      config:
        mentionsExtractorConfig:
          includeSentenceInOutput: true
        jobDescriptionHighlightsExtractorConfig:
         sentenceMinWordCount: 3
         sentenceMaxWordCount: 25

The $(identifier) syntax is used to refer to the output of another component/step in the pipeline. $(inputItems) is defined by MLCC and refers to the values that were fetched from the input value providers. The RESTServiceValueProvider has properties for all of the information needed to perform the REST call. The host name, timeouts, etc. for the REST service are defined in a separate MLCC config file and are referenced using the serviceId. The input is the JSON body of the REST request. MLCC automatically converts the configuration in the input section into JSON which is sent with the request.

The next step is to build the Elasticsearch document using the review data and the model prediction result. The review documents are built using a custom component, ReviewDocsBuilder:

- !ReviewDocsBuilder
  id: reviewDocs
  decorators:
  - !TextTruncateDecorator
  - !AllTextDecorator
  - !ExtractedAttributesDecorator
  - !QuestionsDecorator
  - !RelevanceScoreDecorator
  - !CategoryDecorator
  input:
    records: $(inputItems)
    modelPrediction: $(modelPrediction)

This builder passes the records through a number of decorators which set values in the documents. The decorators are regular Java classes with an annotated constructor that allows them to be constructed from YAML config.

Custom Components

A custom component is implemented in MLCC by defining a class annotated with DataFlowComponent.

@DataFlowComponent
public class ReviewDocsBuilder {

    @DataFlowConfigurable
    public ReviewDocsBuilder(
      @DataFlowConfigProperty List<ReviewDocDecorator> decorators) {           
        ...
    }

    @OutputValue
    public Map<String, ReviewDoc> getValue(
      @InputValue DataRecords records,
      @InputValue DataRecords modelPrediction) {
        ...
    }
}

The component class should have a constructor that is annotated with DataFlowConfigurable. Property values defined in the component’s YAML config will be converted to the appropriate types as needed and passed to the constructor. The class should also define a get value method. The properties in the input section of the component’s YAML config will be passed to this method.

Pipeline Output

The Elasticsearch documents built in the previous step are written to an Elasticsearch index using another built in component:

- !ElasticsearchWriteDocs
  id: reviewsWriteDocs
  clusterId: reviews
  type: reviews-$(featureVariant)
  name: $(featureId)_$(featureVariant)_$(featureVersion)
  version: $(featureVersion)
  waitForIndexingToFinish: $(waitForIndexingToFinish)
  input:
    documents: $(reviewDocs)

Here we see references to other values provided by MLCC including featureId, featureVariant, and featureVersion. $(reviewDocs) is a reference to the review documents built in the previous step. Finally the output section of the configuration specifies the records that should be written to the feature store.

output: $(modelPrediction)

Update Configuration

The pipeline can be used for full update, incremental update, and update of specific review ids. MLCC allows additional configuration to be defined to customize the behavior of each of these cases. This includes the schedule, batchSize, parallelism, and retry behavior. Task specific properties can also be defined and referenced within the pipeline configuration. Finally additional pipelines can be specified to run before the update starts or after it completes. This is used to create the index and switch the active index when performing a full update.

fullUpdate:
  schedule: '0 0 7 * * *' # daily at 7am
  batchSize: 75
  parallelism: 12
  prefetchCount: 8
  retry:
    maxAttempts: 120
    intervalMillis: 5000
  properties:
    waitForHealthyCluster: true
    switchDurationMinutes: 60
  dryRunProperties:
    waitForHealthyCluster: false
    switchDurationMinutes: 0
  beforeStart:
    Actions:
    - !ElasticsearchCreateIndex
      clusterId: reviews
      type: reviews-$(featureVariant)
      name: $(featureId)_$(featureVariant)_$(featureVersion)
      version: $(featureVersion)
      disableReplicas: true
      transactionLogAsync: true
      schema:
        !ClassPathFileDataSource
        path: reviews.index
      schemaVersion: 4
      ...

Summary

Prediction pipelines are often needed as part of an overall ML solution. We have found that they tend to follow a similar high level pattern which lends itself to implementation using reusable components that are defined and connected using YAML configuration. Given that we are moving more toward real time freshness for our ML based predictions it is useful to have a solution that allows the same pipeline definition to be used in both real time and batch contexts. As we build out our library of reusable components developers can focus more on defining what they want the pipeline to do and spend less time dealing with repetitive implementation details. If you are interested in learning more be sure to check out the open source project: https://gitlab.com/glassdoor-open-source/ml-control-center