MLOps end-to-end system on Google Cloud Platform (I): Empowering Forecasting Solutions

A big picture of our MLOps-driven forecasting system, addressing all key points of ML Operations

Published in

gft-engineering

9 min readMay 8, 2024

Given the need to operationalize machine learning (ML) workflows, the Machine Learning Operations (MLOps) paradigm has recently emerged. MLOps aims to facilitate a smoother transition towards real ML systems that could be easily productionalized. This approach integrates several key principles that build upon the best practices in software engineering, machine learning, and DevOps. On the other hand, time series forecasting is one of the most common ML problems in industry, being present in sectors such as manufacturing, retail, energy or finance.

In this first article of our “MLOps end-to-end system on Google Cloud Platform” series, we present our MLOps-based approaches adopted during the ML solution development.

MLOps in a glance (image from Patrik Sharma)

The second article, MLOps end-to-end system on Google Cloud Platform (II): Our solution in detail, can be read here.

MLOps fundamentals

Understanding MLOps as the set of principles and best practices that ensure efficient and reliable ML systems, our focus during development has been to follow all the key points of the paradigm.

Modularity: loosely coupled architecture

Modularity refers to developing systems based on independent components or modules, each one with its own specific function or task. This facilitates the scalability and reusability of solutions, as well as the flexible integration of these single components to build more complete systems.

We have had this idea in mind from the very beginning and, although it is not only a MLOps best practice but of general software development, it has been considered while building the components (i.e., pipeline steps) of the training and inference pipelines. Each of these steps is thus designed to have its own, concrete purpose, and it should ideally be deterministic (same inputs = same outputs). Leveraging this practice, we have reused identical steps in both training and inference pipelines (for example, our custom add_lags pipeline step). Inputs and outputs enable communication and artifacts transmission between steps.

Fragment of a pipeline execution graph. Each component performs a specific task in its step of the pipeline (image by authors).

Machine learning pipelines orchestration and automation

ML pipelines represent a succession of independent but connected tasks (represented as components), each with its particular purpose, that are concatenated to build a complete workflow. MLOps requires automation and orchestration of those complex workflows, which must run in collaboration. These tasks are mostly handled by Vertex AI, Google Cloud’s fully managed ML platform.

Our core efforts in the entire system development have been placed on developing reusable, scalable and consistent ML pipelines using Vertex AI and Kubeflow Pipelines (KFP). As previously mentioned, although both pipelines are developed separately and use modular components (each with its own independent task), they should function accordingly in a complete ML system. For example, the same feature engineering process must be followed (e.g., adding certain datetime or lag features), or the forecasting horizon must be configured the same in both situations (training/inference).

Thanks to pipeline orchestration and automation in Vertex AI platform, Continuous Training (CT) is also guaranteed. This makes even more sense while working with time series, since features usually drift naturally over time (e.g., due to seasonality or periodic trends) and being updated with the latest records is sometimes strictly necessary. Besides, our system enables an on-demand re-training, which could be triggered from our UI if the engineer detects a significant decay on our model performance over time.

Scheduled pipelines in our ML forecasting system (image by authors).

Continuous Integration and Delivery (CI/CD) pipelines

CI/CD pipelines are also in the center of the MLOps paradigm. They support automating certain tasks that would otherwise require manual steps.

By leveraging Cloud Build, the serverless CI/CD platform of GCP, we have constructed several workflows that enhance particular functionalities and optimize frequent procedures.

Trainer Image CI/CD workflow. Triggered by a pushed commit that involves changes in (a) trainer source code, (b) trainer dependencies, or (c) trainer Dockerfile definition, this CI/CD pipeline (1) builds an updated version of the trainer docker image, and (2) pushes it to our Artifact Registry.

### cloudbuild-trainer-image-workflow.yaml
steps:
- id: 'Clone Cloud Source Repository'
  name: 'gcr.io/cloud-builders/gcloud'
  args: ['source', 'repos', 'clone',
         '${_CLOUD_SOURCE_REPOSITORY}',
         '--project=${PROJECT_ID}'
         ] 
         
- id: 'Build updated Trainer Docker image'
  name: 'gcr.io/cloud-builders/docker'
  args: ['build',
         '-f', '${_DOCKERFILE_NAME}',
         '-t', '${_ARTIFACT_REGISTRY_REPO_LOCATION}-docker.pkg.dev/${PROJECT_ID}/${_ARTIFACT_REGISTRY_REPO_NAME}/${_TRAINER_IMAGE_NAME}:${_TRAINER_IMAGE_TAG}',
         '.'
        ]
  # workdir to execute this step from
  dir: '${_CLOUD_SOURCE_REPOSITORY}/training-pipeline'

substitutions:
    _CLOUD_SOURCE_REPOSITORY: 'poc-mlops-asset'
    _CLOUD_SOURCE_REPOSITORY_URI: 'workspace/${_CLOUD_SOURCE_REPOSITORY}'
    _ARTIFACT_REGISTRY_REPO_LOCATION: 'us-central1'
    _ARTIFACT_REGISTRY_REPO_NAME: 'poc-mlops-asset'
    _TRAINER_IMAGE_NAME: 'time-series-trainer'
    _TRAINER_IMAGE_TAG: 'latest'
    _DOCKERFILE_NAME: 'trainer.Dockerfile'                          

options:
    dynamicSubstitutions: true
    logging: CLOUD_LOGGING_ONLY

tags: ['training-pipeline']

# This automatically pushes the built image to Artifact Registry
images: ['${_ARTIFACT_REGISTRY_REPO_LOCATION}-docker.pkg.dev/${PROJECT_ID}/${_ARTIFACT_REGISTRY_REPO_NAME}/${_TRAINER_IMAGE_NAME}:${_TRAINER_IMAGE_TAG}']

Cloud Build history for Trainer Image CI/CD workflow (image by authors).

Dashboard UI continuous deployment workflow. Following a similar approach than the previous one, it is triggered by a pushed commit that involves changes in (a) customized Grafana instance Dockerfile, (b) predefined dashboard and datasources configuration, or (c) provisioned dashboards. This pipeline (1) builds an updated version of the Dashboard UI Docker image, (2) pushes it to our Artifact Registry, and (3) seamlessly deploys a new Cloud Run revision based on this updated version.
(Training / Inference) YAML pipeline templates workflow. Triggered by a pushed commit that includes a new compiled version of a pipeline template in the pipeline templates directories, this CI/CD pipeline (1) extracts the pipeline version using regex from the filename, (2) tags the template file with the version, and (3) pushes it to our associated Kubeflow Pipelines Artifact Registry.

Kubeflow Pipelines repository in GCP Artifact Registry (image by authors).

Cloud Functions workflows. These Cloud Build workflows have been developed for each developed Cloud Function whose HTTPS trigger is based on a schedule (e.g., data collection function, pipeline runs history retrieval function or monitoring reports generation function). They are triggered due to pushed commits that involved changes in (a) source code of the cloud function, (b) requirements.txt file, or (c) runtime environment variables stored in a .env.yaml file in Cloud Source Repositories. The steps are (1) re-deploy the Cloud Function with the updates, and (2) re-deploy the Cloud Scheduler jobs that trigger the function on a specified cron schedule.

Experiment tracking

Storing pipeline runs let us track and compare machine learning experiments across the ML system lifecycle and over time. This allows further exploration about training model performance according different combinations of hyperparameters.

## task.py snippet from Trainer source code
aiplatform.init(project=PROJECT_ID, experiment=EXPERIMENT_NAME)
    with aiplatform.start_run(PIPELINE_RUN_NAME):
        # Log training params
        aiplatform.log_params(xgb_params)
        # ...
        # Log training performance (TensorBoard)
        for i, rmse in enumerate(val_error):
            aiplatform.log_time_series_metrics({'rmse': rmse}, step=i)
        # ...
        # Log training metrics
        aiplatform.log_metrics({"mse_train": mse_train,
                                "mse_test": mse_test})

We use Vertex AI Experiments to track training pipeline runs. The Trainer script is responsible for logging the selected hyperparameters and the resulting training metrics to our specific experiment, so that all this data is stored for further analysis if required.

Vertex AI Experiments training runs performance comparison (image by authors).

Model Registry

Model Registry is an important concept within MLOps. It enables easier model management and versioning, acts as a repository for ML models, and it represents the connection between our training (which registers the model) and inference (which retrieves the model) pipelines.

In our MLOps system, Vertex AI Model Registry is the centralized storage for our forecasting models. In addition, our training pipelines store model artifacts (e.g., model.joblib, model.pickle) in a specified bucket directory of Google Cloud Storage (GCS).

Testing

Testing ML systems requires a slightly different approach than the one followed in traditional software. Pipeline components should be designed so that they can be tested using unit tests. Nevertheless, some components that interact with external sources (e.g., BigQuery) are out of this design rule.

To this end, we have used pytest as our testing framework. Since our Python-based KFP components are decorated Python functions, the key idea is to test not the components but the inner functions themselves. Besides, mocking input and output artifacts is sometimes needed.

ML Metadata and Logging

Managing the lifecycle of metadata consumed and produced by ML workflows is crucial, so this chapter involves storing all ML pipelines logs, in order to assist on debugging tasks when issues appear.

GCP services assist us in this task, since Vertex AI pipelines store all metadata and pipeline execution artifacts in Vertex ML Metadata and Google Cloud Storage (GCS), and logs are centralized in Logs Explorer.

Continuous monitoring

One of the most demanding tasks in a ML system design is continuously monitoring both data and concept drift and also model performance. Specially in a forecasting scenario such as ours, this is extremely challenging, since actual values usually arrive when model predictions have already misled us. Delayed ground truth values make the feedback loop slow, and it is hard to notice under-performing models. To this end, monitoring data and target drift is key for model maintenance, understanding drift in features or target and even for debugging model decay. Re-training could be sometimes needed, but it is crucial to understand well the whole environment to determine whether this will improve forecasts or whether it will be an unnecessary extra expense that could be avoided.

For this task, Evidently.AI, an open-source ML observability platform, has been our choice. It comes with methods to track data drift and target drift, study incoming data distributions, check expected data schema and several validations (e.g., ratio of features that have drift, features whose value range is unexpected), and also compare model performance with training performance to detect model decay or (maybe) overfitting in our model.

Our strategy here consists of generating daily monitoring reports and test suites right before scheduled inference, so that all ground truth records are already available and we can compare the input data at inference time with the data the forecasting model was trained on. Thus, we explore (1) data used at inference time to study if data drift is present, and (b) model performance in that day’s forecasts.

Evidently dashboards: test suites and monitoring reports (data drift, model performance) (image by authors).

NOTE: Ideally, these test suites reports that are generated daily could trigger a re-training if considered necessary.

Versioning and reproducibility

This practice implies not only versioning source code, but also versioning compiled pipelines, trainer Docker images, forecasting models and even training datasets. This enables a broad overview of what is happening in all steps of the ML lifecycle, and also helps during debugging when unexpected events occur.

Pipelines repository in GCP Artifact Registry.

In our MLOps system, pipeline templates (both training and inference) and trainer Docker images are versioned and stored using a Kubeflow Pipelines repository and a Docker repository in Artifact Registry. As stated before, Vertex AI Model Registry is leveraged for ML models versioning. On the other hand, Vertex AI Datasets is in charge of versioning training datasets (thanks to a GCP pre-built component called ‘TimeSeriesDatasetCreateOp’ that is included in our training pipeline once data processing finishes).

Use-cases of our MLOps-driven system

To conclude this first reading, we have compiled some possible use cases in which an end-to-end forecasting solution such as the one we propose could be a great alternative:

Production optimization: enable proactive adjustments to optimize factory efficiency
Energy management and cost reduction: optimization of HVAC systems, achieving significant savings in energy costs
Efficient resource management and utilization
Quality issue prevention: anticipating environmental changes helps prevent quality problems, avoiding defects in the final products
Predictive maintenance and planning: forecasting conditions affecting machinery enables proactive maintenance scheduling, reducing unplanned downtime.

For a detailed explanation of how we developed our ML solution, refer to the second article in our series: MLOps end-to-end system on Google Cloud Platform (II): Our solution in detail.

Any questions or suggestions? Just reach out us!

Authors

Roberto Hernández Ruiz

Ferran Aran Domingo

@GFT Technologies, Artificial Intelligence Offering