MLOps: Big Picture in GCP

Park Chansung
Google Developer Experts
14 min readMay 11, 2021

Twitter has become the most important platform for various people to get recent news since famous theorist, practitioners, etc. in your interest are continuously posting what they have found, built, and more.

I believe I am not the only one who retweets whenever I encounter interesting tweets. I tell myself like “ok, this looks interesting, but I am not going to read it right now. Let me put it in my feed, so I can read it later on..”. But this never happens.

I have decided to build a small application that pushes a newsletter containing all the retweeted tweets every week. However, if I just send them without any pre-processing, there is no difference from just scrolling around through the Twitter application. This is where Machine Learning might play a great role by classifying tweets in a meaningful way.

Overview

The name of this project is “personalized/curated newsletter service”, but I am not going to give an in-depth explanation in this blog post. Instead, I am going to provide a big picture about what is MLOps(Machine Learning Operations) and how you can realize it within GCP(Google Cloud Platform) environment.

The following components will be covered in this blog post with short descriptions. In-depth explanations are expected to be delivered with dedicated additional blog posts in the future.

  • GCP Kubernetes Engine: manages underlying hardware resources and allocate appropriate resources to appropriate jobs.
  • GCP AI Platform (Pipeline): manages machine learning job dependencies by guaranteeing certain jobs should be run beforehand in Kubernetes environment.
  • TensorFlow Extended (TFX/MLOps): a high-level framework providing a handy way of write job specifications for AI Platform Pipeline and Kubernetes.
  • GCP AI Platform (Training): a distributed training service at scale with a highly flexible configurability.
  • GCP AI Platform (Prediction): a scalable model serving service with a unified end point and model version control.
  • GCP Cloud Storage: a filesystem service that you can save files that are commonly used for machine learning (CSV, Parquet, TFRecord,…)
  • GCP Cloud Function: a set of logics triggered based on event. You can listen on the changes of data in Cloud Storage and trigger running the machine learning pipeline for the newly collected data.
  • GCP Cloud Build / GitHub Action: a unit test and deploy service for a newly integrated/merged code base. The machine learning pipeline can be run upon the new codes for different data pre-processing or modeling.

These components can be combined all together to build an entire architecture for my toy project. It’s something like in Fig1. You don’t need to worry if you don’t understand what each component does. I will give short descriptions for each, and you can come back and check if it makes sense after you finished reading this blog post.

Fig1. Overall Architecture

“curated newsletter service” sounds easy if you think only about the application part because it’s simply querying to public APIs. However, imagine you are the one who is building your own APIs, and that APIs are built on top of the power of machine learning.

Why Do We Need Machine Learning Operation?

This is better understood with another question “Why do we need to consider different CI/CD pipeline for machine learning project?”

I believe that the simplest answer to this question is “because the data changes over time whether or not you want”. Indeed we are not the main agents but the environment, and the environment makes changes to the data, and we can not stop the changes happening. So it makes sense that we always have to be prepared for such situation. Otherwise, we could take criticism because of the lack of responsibilities.

There are two common scenarios that we, as a developer, have to handle when data changes (Assume that I have collected brand new data and labeled them appropriately).

  • The currently deployed model re-learn the changed data accordingly. In this case, the system listens to the data change event, and run the whole machine learning pipeline.
  • The currently deployed model couldn’t achieve the desired accuracy, so we need a new model with appropriate data pre-processing. A new model means the code base has changed. In this case, the system should unit-test the new code base, listens to the code integration, and run the whole machine learning pipeline.

Why GCP and Why Cloud in General?

The whole machine learning pipeline consists of multiple components. Fig2 was introduced in a popular paper “Hidden Technical Debt in Machine Learning Systems” to highlight modeling is just a small part of the whole pipeline.

I disagree with the size of the box for “ML Code”. Modeling is a hard work in my opinion, and I don’t think something is more important than the others. I just want to look at the Fig2 in a way that there are multiple parts that we have to consider when we are planning to roll out machine learning powered products.

The thing is there are several components that we have to hook together. If you have ever experienced to integrate different software components into a complete system, you may know how hard that could go. Especially, building an infrastructure is hard one because data scientists or engineers are normally not familiar with the deep knowledge about hardware, software, operating systems, and so on.

Fig2. Components Consisting of Machine Learning as a Product

However, infrastructure is very important at the same time since everything should be run at scale. For instance, more of high performance CPUs and high capacity RAMs should be get involved when you have more data. As state-of-the-art(SOTA) models in machine learning world are getting bigger and bigger, the model training system should be able to expand accordingly. You cannot predict when the number of users for your service gets exploded, so the serving system should be able to expand accordingly as well.

It is not easy to do these things at once, even if you are an open source genius. You have to spend quite large amount of time to build such system, and it has to be constantly monitored. In my opinion, the best option instead is to leverage public cloud services and gradually move from cloud to on-premise environment as necessary(when you think you are ready).

Just like other cloud platform, you can consider Google Cloud Platform(GCP) as an all in one solution for your machine learning product. You still need to hook different services provided by GCP, but It is much easier because all of them are in the same platform. Also they are not open sourced project which means much of their functionalities are proven to be working for the most of the cases.

I chose to use GCP not because that I am Machine Learning Google Developers Experts(ML GDE) but because GCP is the best environment when working with TensorFlow, TensorFlow Extended, and other frameworks released by Google. In my opinion, PyTorch is a great option for modeling, but the ecosystem is not mature to support MLOps yet.

Role of Each GCP Services in MLOps

I this section, I am going to briefly overview each GCP services comprising the whole MLOps system in bottom-up approach(like from infrastructure to higher level services). In-depth explanation and hands-on practice will be covered in the future writings in separate blog posts.

Google Kubernetes Engine

Google Kubernetes Engine(GKE) is a fully managed Kubernetes hosted in GCP. Here, “Fully managed” means that you care what it does not how it works under the hood. What GKE does is simple. It manages underlying hardware resources by grouping them into what’s called “Node Pool”. Then GKE picks nodes suitable to requested jobs. For instance, a request could be like “I need a GPU” or “I need a node with more than 10G RAM”.

Fig3. Google Kubernetes Engine

Within a native GKE or Kubernetes environment, you should write long specification consisting of metadata, labels, docker images, policies, etc. all by yourself. Don’t worry though. You don’t need to handle a very specific request and resource allocation rules by yourself when GKE is combined with TFX and AI Platform Pipeline service that we will discuss shortly. You can just think of them as a high-level wrapper to control GKE in a meaningful way for machine learning specific pipeline.

AI Platform Pipeline

As you know, jobs consisting the whole machine learning pipeline are dependent on other jobs. That means certain jobs must be finished before preceding the next ones. For example, data should be injected first before the data preprocessing is performed, and data preprocessing should be done before the model training, and so on.

Generally speaking, GKE doesn’t care about the order dependency. It just rent out available resources to execute any given jobs. Also, jobs running on GKE are stateless which means the application itself doesn’t store any permanent information, so it is programmer’s duty to manage attached storage for saving permanent information. Also, attached storage is not enough for machine learning pipeline since information should be shared across different jobs.

Fig4. AI Platform (Pipeline)

Kubeflow is the saver for this complicated dependency management. As the name “flow” states, it manages a flow of given jobs. You write some different parts of codes specifying dependencies between jobs, then Kubeflow wraps each job in a container and passes them to GKE. Information sharing is handled internally.

However, Kubeflow itself is very hard to be managed. I guess a whole team or at least a person should have a dedicated role for installing, managing, and monitoring for it. Fortunately GCP also provides fully managed version of Kubeflow, and that is AI Platform Pipeline. With AI Platform Pipeline, you no long need to care the hassles.

AI Platform Notebook

AI Platform Notebook is just a Jupyter Lab hosted on GCP. The only thing that I want to mention about AI Platform Notebook is it can be pretty easily hooked up with AI Platform Pipeline. That means you could easily experiment writing codes for AI Platform Pipeline because notebook is a nice interactive environment.

Fig5. AI Platform (Notebook)

However, you almost certainly don’t want to write codes for AI Platform Pipeline by yourself because you should have a deep understanding about the internal runtime environment of the AI Platform Pipeline. Of course you want to go deeper and deeper, and you probably have to master the Kubeflow eventually. But you don’t want to spend too much time on it if what you really care is to build up the whole pipeline then inspecting the details later(like you are desperate to launch a product for now).

TensorFlow eXtended(TFX) is the rescuer TFX is just a kind of wrapper framework for defining job specification that could be easily thrown to AI Platform Pipeline. It is like letting you only focus on your business logics while leaving the internal details aside.

TensorFlow eXtended(TFX)

As the name says, TFX is an extension for TensorFlow. TensorFlow is a modeling framework for machine learning and deep learning, and it has nothing to do with production environment handling. TFX comes with lots of add-ons along with TensorFlow to make models written in TensorFlow to be production ready.

Fig6. TensorFlow Extended

Fig6 shows a list of standard TFX components. To name a few with a short description, let me list some of them in the below.

  • ExampleGen: responsible to handle data injection. Data could be come from local filesystem(with variety of formats like Parquet, CSV, TFRecord,…), BigQuery, Google Cloud Storage, and so on.
  • Transform: builds up a data transform graph which is going to be attached to the model graph later. It could let you avoid training/serving skew problem. You can simply use TensorFlow library along with TensorFlow Transform(TFT) library.
  • Trainer: is all about defining and training model. In an early version of TFX, only tf.estimator based models were allowed. However, you can just write tf.keras based models as well which should be very familiar to you. Also, you can easily combine this component with AI Platform Training service for distributed model training.
  • Pusher: is to push the blessed model to designated location. “Blessed” here means the best one by evaluating by given metrics and comparing with the latest model. Also, you can easily combine this component with AI Platform Prediction service, and if you do, you will be a model version management system and end point for serving the model with a small amount of effort.

Initial Launch for the Pipeline

Actually, you don’t need to run TFX pipeline on a notebook environment. However, it is recommended to do so if it is the initial phase of the project because you have to interactively experiment with TFX, cloud services, cloud environment quite a lot.

Fig7. GKE Job Allocation

Fig7 shows what happens when you run TFX pipeline. Job specifications are tossed to AI Platform Pipeline, and then AI Platform Pipeline wraps each job in a container and pass them to GKE, and then GKE executes each job in a currently available resources. As you could see, each job is guaranteed to be run in a sequential manner.

AI Platform Training/Prediction

GCP comes with a lot more services than just AI Platform Notebook/Pipeline and GKE, and there are some services dedicated for data pre-processing, model training, model tuning, and model prediction. You want to leverage those services because of their scalability with flexible configurability. Also, note that GKE has no idea about distributed training which more than one GPUs are collaborating.

Just imagine the following situations.

  • Your data grows bigger: You continuously collects more data, and you need more computing resources for data preprocessing accordingly. And for this particular job, it is normal that high-end CPUs and high-capacity RAMs are required.
  • Your model and data grow bigger: As you collect more data, you can expect to get the most out of the current model much further. Or since SOTA models are announced too often these days, you want to replace the existing model with more promising ones. In either cases, you certainly need more GPUs or even TPUs in a distributed guaranteed environment.
  • The number of end users grow bigger: It could be enough to have just one server for serving a model end point if you have like 100 of users. Then the number of end users grows to like 100,000. If so, one server is not enough, but you should configure more servers for your model to accommodate the growing number of users.
Fig8. Dedicated GCP Services for Data Preprocessing, Model Training/Tuning, and Prediction

The listed situation can go in a reverse direction, so the key is the scalability, and the scalability comes from the powerful infrastructure. However, infrastructure is very hard to configure and manage since you should have a solid understanding about hardware, system programs, operating system, software infrastructure like Kubernetes, machine learning in general, and more. There should be a whole team to do this. But, this thing can be super easy if you leverage services provided by GCP.

The listed each situation can be mapped to a dedicated GCP service, and you just need to make a small changes to TFX configuration.

  • Your data grows bigger: Dataflow
  • Your model and data grow bigger: AI Platform Training/Tuning
  • The number of end users grow bigger: AI Platform Prediction

Data Sources

The data for machine learning can be stored in multiple ways, but the most usual cases might be using filesystem(in a number of different formats such as CSV, Parquet, TFRecord,…) or DBMS(Database Management System). In GCP, you can leverage Cloud Storage service for the formal case and BigQuery for the latter case.

Fig9. Data Storage Options

Of course, both of them can be easily hooked up with TFX. You just need to import appropriate package(ExampleGen for filesystem, BigQueryExampleGen for BigQuery). Also note that with Cloud Storage service, you could get data version control for free.

Cloud Function & Cloud Build

Now, at this point, the entire machine learning pipeline is established. However, there are two missing parts. We don’t want to run the entire pipeline every time by opening up a notebook. Rather, we might need automatic ways to run the pipeline.

Then, in what situations do we want to run the pipeline automatically?

  • When we have gathered more data
  • When we code up better data pre-processing strategies or more recently announced(published) SOTA models.

In the former case, there should be something that is capable of listening on the changes to the data version and trigger actions based on that listening. In the latter case, you should be able to run the pipeline by calling an endpoint. This is the same procedure to what’s done with notebook, but we need an automatic way at this time.

Fig10. Cloud Function Listening to the Changes of Data Source and Launching the Pipeline

Cloud Function in Fig10 is a serverless service. That means it wakes up on a certain event and run the specified codes. On GCP environment, one of the events that trigger a Cloud Function is the changes in Cloud Storage or BigQuery. So, by simply listening on that event, we could determine when the whole pipeline should be run.

Fig11. Cloud Build Building Source Code and Launch the Pipeline

Cloud Build in Fig11 is a source code building service when there are changes in the code repository. You can run a number of unit tests, and when it is good to go, a new machine learning pipeline code can be deployed and run. This is a useful feature when we modify the existing codes to support different data preprocessing strategies, different models, or even different deployment options for like TFJS or TensorFlow Lite.

Summary

We have seen which GCP services could comprise the entire end-to-end machine learning pipeline system.

Hardware and software infrastructures are managed with GKE and AI Platform Pipeline, and job specification to be passed to AI Platform Pipeline could be written in high-level TFX framework. Also, scalable data pre-processing, distributed model training, and scalable model serving can be handled by dedicated GCP services which are Dataflow, AI Platform Training, and AI Platform Prediction. There are a number of storage options, but my recommendation is to use Cloud Storage or BigQuery. When those storage options are combined with Cloud Function, changes in data can be detected, and the pipeline can be triggered to re-train the model with newly collected data. Finally, Cloud Build lets you write/modify new codes and trigger the pipeline based on the newly integrated code base.

However, you can map each GCP service to other equivalent services or open source projects if you still think you are going to have to build everything by yourself.

Who am I?

My name is chansung park, and I am one of Machine Learning Google Developers Experts(ML GDE). Although I am interested in everything in machine learning, my focus in this year(2021) is MLOps especially in GCP environment.

--

--