Enterprise MLOps part 2: ML Pipelines with Vertex AI

Paul Balm
Google Cloud - Community
13 min readJan 27, 2024

Many organizations using Vertex AI are working on operationalizing their machine learning work using Google Cloud infrastructure, so that they can scale their work and expand the impact of ML in their business.

Operationalization requires automation and orchestration of the different tasks involved: data loading, preprocessing, model training, evaluation… And we typically implement these tasks as a pipeline.

It turns out that deploying a series of tasks that are orchestrated to run in collaboration with each other on remote cloud infrastructure involves a steep learning curve. This technology is not easy to learn, and building a good pipeline requires a number of design decisions. In this post, we will describe what the guiding principles that we use across our projects at Google Cloud Consulting.

We will focus our examples on TFX and Kubeflow Pipelines, because these are the pipelines that can be executed natively on Google Cloud using Vertex AI Pipelines, Google Cloud’s fully managed ML pipeline service. It’s possible to execute many other types of pipelines on Google Cloud, for example using Spark or Airflow. But Vertex AI Pipelines is a pipeline orchestrator that provides some valuable features for machine learning.

We also limit our focus on ML pipelines that take (training) data as an input and have a model as their primary output. These are not the only ML pipelines, because other processes that are part of an ML system can be implemented as pipelines, and can therefore be considered “ML pipelines” as well. Examples could be a monitoring pipeline and an evaluation pipeline. But we focus here on the “ML training pipeline”.

This is the second article in our series on MLOps processes:

But before we get started, let’s take a step back and consider:

Why pipelines?

The problem at hand is that to operationalize the process of training machine learning models. Training a model requires loading training data to where the model is being trained, for example by extracting it from a database. Before training the model, we probably want to apply some preprocessing: Data cleaning and imputation, feature engineering, and data validation are typical activities. We then want to train the model, evaluate its performance, deploy it… In short, we have a large number of activities that need to be executed in a certain order.

“Pipeline” is a design pattern. It’s a way of implementing an application that runs tasks in a certain order. But at least in simple cases, we can use a shell script to do this, which is not hard to learn at all. So why use the pipeline pattern?

For some applications, the “monolith”, where all the code is in one application that runs on one computer, is a perfectly good design. Even if our application trains a ML model, if the model is fairly trivial and training requires few resources, then building a monolith can be reasonable. But in real-life ML scenarios, a pipeline is usually a better choice. A pipeline:

  • Has a modular structure, which improves maintainability and understandability
  • Allows us to choose the right hardware for execute different “modules” (pipeline steps) on different hardware
  • Facilitates tracking of intermediate artifacts, which in turn helps debugging, monitoring, and also efficiency because it permits caching intermediate results and restarting from previously cached results.

What’s a good pipeline?

Without trying to be exhaustive, let’s review some attributes of good pipelines.

First, a good pipeline achieves the right level of modularity. When we organize the work that needs to be done to train a new model (and after training it), we split it into different pipeline steps. What is the right “size” of each step? When should we continue to split, and when are we done? One should consider efficiency: Larger steps tend to be more efficient with the hardware, because there is less hand-off between steps, but they are harder to parallelize. Breaking down your application into different steps can make the application easier to understand when it executes, but if you make very many very small blocks, then it can actually become harder to understand. Similarly for reusability of pipeline components: If it’s too small, there is little value and if it’s too large, it becomes unwieldy and difficult to reuse for that reason. Another consideration is that you may want to track the output of each block, but if the blocks are too small, the output may not be meaningful or easy to understand. In summary, there are no hard rules, but important considerations are efficiency, understandability, reusability and traceability.

It’s also important for a pipeline to be complete in terms of functionality. If we work back from the nuclear task that is training the model: Preparing the training data should not involve manual steps that need to be performed before the pipeline can be started. Looking at what comes after training: The pipeline needs to leave the model (weights) where it can be used. This can mean, for example, uploading to a model registry, or deploying it, or both.

Aside from uploading the model and leaving it ready to use, the pipeline should prepare metadata or documentation that gives us the information that we need to use this model confidently, in terms of performance, legal aspects, and more.

Completeness also means having functionality that a robust ML system requires. If we want to implement full “continuous deployment” (the often overlooked half of CI/CD), then we should deploy our model from our pipeline. A robust deployment implements monitoring that allows us to track the model performance, so we can know when something goes wrong, or when gradual degradation has reached the point where we want to retrain our model with the latest data.

Pipeline frameworks: Kubeflow vs. TFX

The Kubeflow Pipelines SDK provides different types of components and intermediate artifacts, and allows you to use these to construct pipelines. For example, you can use a “light-weight Python component” that transforms a Python function into a step in your pipeline. The return values from this component can be passed to the next pipeline component.

Kubeflow logo

This is very different from how TFX works. TFX is short for Tensorflow Extended. Originally, it’s built around Tensorflow, which allows you to define and train models, but nothing else. TFX is a collection of pipeline components that you configure. These pipeline components fit together in one way, and you configure them to do what you want. For example, you will have to provide the source for your training data so that the process can start, and you will have to provide the training script for the component that trains the model. Some components are optional, but basically, TFX provides you with one pipeline that always has the same steps in all your ML projects. The advantage is that the TFX pipeline components have already solved many problems that an operational ML system will have to solve. Kubeflow components are easier to understand, but unless you use the reference pipeline that we will describe later, you are essentially starting from scratch. A downside of TFX is that it fits more naturally with Tensorflow models, but you can train any type of model using TFX.

Tensorflow logo

Best Practices

We will start by providing a reference pipeline implemented using Kubeflow, and one using TFX. Then we will follow up these reference implementations with the best practices that they implement.

Kubeflow Reference pipeline

Our Kubeflow reference pipeline has the following components:

  • Load and split data: Extraction of the training data from its storage (for example, BigQuery or an on-prem database or API) and split it into training data, evaluation data and test data
  • Model training produces a trained model plus performance metrics
  • Upload the trained model to the Vertex Model Registry
  • Upload a model evaluation to the Vertex Model Registry:
    - Upload the evaluation data to BigQuery
    - Optionally, sample the data down to obtain a reasonable amount for evaluation
    - Execute a batch prediction job to generate predictions from this sample
    - Reformat these predictions to match the expectation of the following step
    - Provide an evaluation by joining the predictions back to the evaluation data sample and comparing them to the ground truth
    - Import the evaluation to the Model Registry to enrich the model uploaded previously
  • Generate a model card:
    - Generate some plots/graphics (like histograms) using the evaluation data
    - Combine these graphics with statistics of the training and test data produced earlier to generate an HTML model card

The full code for this pipeline is available in our Github repository.

A flow diagram of our Kubeflow reference pipeline
A diagram of our Kubeflow reference pipeline, as displayed by the UI of Vertex AI Pipelines.

TFX Reference pipeline

The TFX pipeline is constructed from predefined components:

  • TrainDataGen loads the training data, for example from BigQuery and transforms it into Tensorflow’s efficient binary “Example” format
  • Generate statistics on the training data (min, max, median, number of missing values, etc)
  • Import a predefined schema for the training data, including value ranges and dictionaries for categorical variables
  • Validate the training data against the imported schema and abort the timeline if the training data does not match expectation. This is especially useful when retraining happens automatically using the latest data.
  • Transform the input data to produce the features that we will use for model training. This step also exports a persistent transform operation that we can reuse during serving.
  • Train the model:
    - Retrieve the hyperparameters to be used during training (learning rate, number of epochs, etc). This is convenient when tuning the hyperparameters.
    - Retrieve a previously trained model to avoid starting from scratch (i.e. use a warm start)
    - Train the actual model
  • To evaluate the trained model:
    - Identify the baseline model for comparison (this could be the currently deployed model, or best trained model thus far)
    - Retrieve the baseline model
    - Evaluate both baseline and currently trained model.
  • Depending on the evaluation output, publish the model (in this case, push to Google Cloud Storage)

We have the full source code for this TFX pipeline in our Github repository as well.

A flow diagram of the TFX pipeline.
A diagram of the TFX pipeline, as displayed by the UI of Vertex AI Pipelines.

We will now review some of the best practices implemented by these pipelines.

Data Ingestion and Split

Before starting the training process, the training data (including test and evaluation data) needs to be efficiently accessible to the training process. If the data needs to be retrieved from an API with latency, or a heavily loaded database, then we should copy it over to a low latency, high throughput storage such as Google Cloud Storage. Very large datasets can still be expensive to store, but per TB, storage costs are low. You should consider maintaining a copy of the original dataset, so that you can trace any trained model back to the input that was used. This can be useful for training, but for testing and validation as well.

Another good practice is to split the input data into the datasets that you require (typically, training, testing and evaluation) and output these splits as separate artifacts of the pipeline step. This allows you to check the dependency diagram of the pipeline, like the diagrams shown before, and ensure that the dataset used to evaluate the model is never used during the training process. Outputting the splits as independent artifacts also allows you to inspect them later, perhaps to understand specific inference results.

Generate a model evaluation

Communicating the performance characteristics of your models is vital for the confidence in these models when using them. A good ML system will make the evaluation available, together with the model. The evaluation should provide a comprehensive overview of the performance of the model. In order to avoid overhead and mistakes, especially since the purpose is to gain confidence, it’s crucial that this evaluation is generated and published as part of the pipeline, and not through a separate process.

An example of the UI for model evaluation, as part of Google Cloud’s Vertex AI Model Registry.

Generate a model card

Another approach is to generate a model card, which is a free-form document that can be adjusted to external documentation requirements, such as those from a governance team within your own organization, or legal requirements. One example that we have come across is an organization that has inspected the source code of a few specific versions of a ML library, as a security measure. This organization therefore requires the ability to verify that any model that is approved for use in production, was trained with an approved version of the library.

Aside from the performance KPIs that are typically included in the evaluation as well, external requirements often demand “explainability”: Feature importance and performance across different data splits, such as different demographics, geographies, or, for instance, income ranges. This helps to spot any biases that your model may have, and it’s also a good place to look when trying to improve the model performance.

Example of a part of a model card, generated using the model card toolkit.

Upload model to a model registry

Rather than leaving a model in a place that is convenient for the training system, we should store it somewhere where it can be properly cataloged, discovered and used. Such a place is often called a model registry. Important functions of a model registry are the storage of metadata, which includes the model name and other descriptive information, version control, and performance evaluation. The performance evaluation could include a precision-recall curve, a confusion matrix, etc, in case of a classifier.

Another aspect is that the model registry should contain all the required information to use the model for inference. Vertex AI Model Registry solves this by associating each model (version) with a container image for inference.

Validate training data against a schema

Validation of the training data is frequently done manually, by producing statistics of the different features (min, max, mean, median, etc), as well as histogramming the distributions. But for production pipelines, it’s important to produce a machine-readable file that describes our expectations for the data that is used for training, retraining, and inference. This way, we can be alerted when new data deviates from our expectation, and we can think about this before we train a new model with this data, and potentially even put it into production.

TFX implements this functionality out of the box through a SchemaGen component that will write the file, based on the training data. Like any automatic schema detection, this does not always work flawlessly, so one should edit the file by hand to prepare it for use. The schema file contains the data types of the features, but alsovalue dictionaries for categorical variables (by default, TFX will write the observed values from the training data), value ranges, and so on.

Avoid training-serving skew due to data transformations

Training-serving skew is not a term that is used consistently. The definition I use is “a difference between training and serving in the data or the treatment of the data”. That means it includes data drift, when the distribution of one or more features changes, compared to the training dataset. We should monitor for this effect, especially if we do not frequently retrain our model regardless of any data drift, but it’s not part of the training pipeline that we are concerned with here.

What does affect the training pipeline, is that the training pipeline often defines the transformations that create the features. For example, normalization of a variable involves subtracting the mean of the distribution and dividing by its standard deviation. This mean and standard deviation is calculated from the training dataset, so how do we apply this transformation during inference in exactly the same way? One way is to store the numbers we need somewhere, but it needs to be done in such a way that we always use the right ones. We may, for example, have multiple versions of the same model in production at the same time, in a canary deployment or A/B test. If the models are trained on different training data, then the transformations to be applied will be different.

Normalization is a simple example, but this problem can produce increasing levels of annoyance. Transformations that we need to replicate from training to inference in this way include one-hot encoding, embeddings, bucketing and more.

The way TFX solves this problem is by storing the transformations in a binary format during training and incorporating them into the model itself, before writing the model to disk.

If you’re not using TFX, but you are using a Tensorflow model, and your transformations are part of Tensorflow Transform, then you can do this yourself too. If you’re not using Tensorflow, then the standard solution is to use the “pipeline” concept from your scikit-learn. If your ML framework plays nice with scikit-learn, then you can build one pipeline of your transformation operations and your model together. This is a mini pipeline that concatenates the feature transformations and the model inferences and makes it look like a single operation. This is the case for xgboost and lightgbm, for example, but not for PyTorch.

In the case of PyTorch, you could still encapsulate all your feature engineering operations into one scikit-learn pipeline, and then inference would require two steps: Running the feature engineering pipeline and passing the output to PyTorch. But scikit-learn pipelines do not scale to large datasets. If we need to run feature engineering on large datasets, then we will have to distribute the work, for example using Spark. Again, TFX has this problem solved for us already. TFX uses Tensorflow Transform which implements the transformations as an Apache Beam pipeline. These pipelines can be run “locally”, but they can also be trivially executed on other runners, such as on a Spark cluster. This means that the transformations defined using Tensorflow Transform are implemented as a Beam pipeline, which actually executes as a Spark job. Google Cloud provides a managed service to run Beam jobs called Dataflow.

Conclusions

We’re sure we’ve missed many valuable best practices in this review, and even the ones we listed can be implemented more consistently in the reference pipelines that we provide. But we don’t want the perfect to be the enemy of the good: We hope this is a useful overview of our experience from the field. The source code links to head versions of the code repository, and we will continue to make improvements.

Feel free to continue with our last article in this MLOps series on CI/CD, where we will discuss the automation of the development process.

--

--

Paul Balm
Google Cloud - Community

I’m a Strategic Cloud Engineer with Google Cloud, based in Madrid, Spain. I focus on Data & Analytics and ML Ops.