Evolution of TVB Machine Learning Infrastructure

From On-demand Kubernetes (k8s) ML Pipeline to Serverless MLOps Pipeline

tech@TVB
tech@TVB
5 min readMar 3, 2021

--

by Benny Tang and Solomon Leung

myTV SUPER and Big Big Shop are one of the flagship online products in TVB. In the past decade, Machine Learning (ML) has been transforming fast and it is ready to deploy in commercial environment. TVB adopted many popular ML models in production to improve the usage of app and provide personalized recommendation to users. These ML Models empowered the applications to become one of the most popular services in Hong Kong.

As the scale and complexity of ML models and ML pipelines continues to grow, the importance of a robust and reliable ML infrastructure has become critical for both prototyping and productionise ML models. With the adoption of several open source technologies and cloud managed services, TVB’s ML infrastructure becomes “Serverless MLOps” one. It has significantly improve both development efficiency and manageability of TVB’s ML pipelines.

In this article, this will firstly discuss the shortcoming of the previous implemented ML infrastructure and the respective improvement that were done. On the upcoming posts, more tips will be shared on using these open source technologies and managed cloud services.

Before : On-demand Kubernetes (k8s) ML Pipeline

Before adopting the latest serverless architecture, on-demand compute were spin-up in kubernetes clusters, together with BigQuery to run ML models training, and training jobs are orchestrated by Apache Airflow.

Here is typical development and deployment cycle:

  1. Develop the data preprocessing and training pipeline in Jupiter server and export the notebooks code to .py files
  2. Push the code to git server (because of having better version control)
  3. Push the training docker image to the container registry for the training job (because every task has its own package dependency)
  4. Run the training job regularly with Apache Airflow and Kubernetes pod-operator (because of trying to avoid “idle” resources)
Conceptual ML development and deployment workflow
Screen-capture of airflow pipeline

What are the problems?

  1. Jupyter server has many problems. Jupyter notebook on a shared one would cause even more problems. (e.g. strange packages dependency error in Anaconda; one heavy task triggered by someone will slow down the whole server, etc.)
  2. Focusing on ML model development is distracted by manual deployment tasks and infrastructure management (k8s cluster, container image …)
  3. Constraint in expanding the compute resource for the ML tasks by Kubernetes Pod operator (No GPU, No distribute training …)
  4. ML tasks are not modularised and troubleshooting can sometime inconvenient (occasionally need to dig deep into the containers …)

After : Serverless MLOps Pipeline

There are only 2 changes in the overall architecture.

  1. Task specific container images have breakdown into multiple granular Airflow operators
  2. K8s training compute replaced by GCP AI Platform service

The development and deployment practice is also changed:

  1. ETL deliverable become SQL and Airflow BiqQuery operators (if needed, you could add Airflow operators for Dataflow )
  2. ML model are developed as python packages and test run on both local and Google services (BigQuery and GCP AI Platform) with gcloud command-line tools (same runtime environment can be setup easily)
  3. Individual tasks are no longer necessary be developed on shared jupyter notebook (with git server, there is already enough collaboration)
  4. When the Airflow pipeline has completed, its can schedule the deployment of the package at the beginning of the pipeline
Conceptual ML development and deployment workflow
Airflow DAG with multiple operators

What are the advantages ?

  1. Both code in development and production run on the same environment (GCP AI Platform). No more “It works in my Jupyter Server, but it is not working in production, Why ?!”
  2. It is a MLOps pipeline. Less manual effort is needed for packaging and deploying the ML training container image.
  3. It is not required to manage infrastructure (k8s) for the ML pipeline anymore
  4. ETL and Data validation tasks can easily be added or modified in the pipeline, with atomic Apache Airflow task (e.g. add a new Dataflow task before model training task)
  5. A wider range of on-demand hardware and compute are available for ETL and ML training jobs. (GPU compute, TPU compute, parameters server, master-slave workers compute, etc.), just type in the parameters, and everything is done!

What is the Extra Benefit of new infrastructure ?

  • Monitoring and troubleshooting of the tasks become easier with performance dashboard and operation logging
  • Configuring ML models hyper-parameters becomes easier with config file
  • hyper-parameters tuning becomes easier with cloud managed services and python package “hypertune”
  • There is a metric “Consumed ML units” which makes calculating of model training cost easy !
  • There are more handy tools in AI Platform (e.g. ML model version control, evaluation tools, data labelings services, etc. )

Summary

A robust and manageable ML infrastructure is critical for ML team in developing more successful ML products by alleviating developer’s daily operational burden. Let’s start adopting the MLOps practice to enjoy these benefits.

--

--