MLOps — Is it a Buzzword??? Part Two

Kailash Thiyagarajan
Walmart Global Tech Blog
5 min readJul 30, 2021
Image : CloudZone

This article is part two of a three-part series which aims to implement MLOps using managed Kubeflow pipelines in GCP. In part one of the series, we learned about MLOps, various orchestration options and the basics of Docker/Kubernetes.

This article will focus more on building Kubeflow pipelines, including more coding examples.

It will cover a few principles of MLOps, including:

  • Portability/Scalability
  • Artifacts Tracking
  • Repeatability/Reproducibility

Kubeflow Component

Image by Author

A pipeline component is self-contained set of code that performs specific function. It has a name, parameters, return values and a body. Containers provide portability, repeatability and encapsulation. Below I limit the scope to Python-based components for further explanations.

The component code contains the logic needed to perform a specific step in the ML workflow (It may be either data extraction/processing using Spark or model training using sklearn/TensorFlow). It’s nothing but a pure Python function annotated with decorator kfp.v2.dsl.component. The component should follow the rules below:

  • It should not use any code declared outside of the function definition.
  • Import statements must be added inside the function.
  • Helper functions must be defined inside this function.

Let’s start simple by looking at an example of Python component which adds two numbers and return the result.

As observed, all your function’s arguments and return types must have data type annotations. It helps to support input/output artifacts logging which is tied to one of the principles of MLOps.

@ component converts your python function into a Kubeflow component and it allows us to define three optional parameters:

  • base_image: (Optional) Specify the Docker container image to run this function in. It supports custom images as well. Default value is the Python 3.9 image.
  • output_component_file: (Optional) Writes your component definition to a file. You can use this file to share the component with colleagues or reuse it in different pipelines.
  • packages_to_install: (Optional) A list of versioned Python packages to install before running your function. Its recommended to create a custom image with all required libraries pre-installed though.

Here is the sample output component file which will be produced if the optional parameter is specified with file name. It can be shared and reused across pipelines. This is nothing but a Dockerfile with a set of instructions.

The Python component can be recreated from the yaml component using the below method.

import kfp.components as comp
comp.load_component_from_file("add_component.yaml")

In addition, migrating from airflow operators to Kubeflow component would be fairly simple using this method.

comp.create_component_from_airflow_op()

Let’s look at a slightly more complex component example.

This component takes two inputs and uses the NumPy package to calculate quotient and remainder using helper function defined within the component code. There are two outputs returned from this component code, one which is defined explicitly using NamedTuple and another which is using the parameter of type artifacts (Output [Metrics]).

If your function output is primitive type, return type can be type annotated as“float”, “str”, “None” etc. If your component returns multiple outputs, you can annotate your function with the typing.NamedTuple type hint and use the collections.namedtuple function return your function’s outputs as a new subclass of tuple.

Average component:

This is the final component in our pipeline where output is artifacts of type dataset. Artifacts represent large or complex data structures (dataset or model).

Pipelines

Image by Author

Now, the components are ready and it’s time to build the pipeline.

Pipelines are created by connecting the input/output interfaces of the components. Alternatively, you can use the before/after method on the pipeline tasks.

Component inputs and outputs are classified as either parameters or artifacts (depending on their data type) and the data sharing takes place through them.

  • Parameters are passed into your component by value and can be of any of the following types: int, double, float, or str or collections. It’s well suited for smaller data (if you want to return any numeric/string/dictionary or a collection).
  • Artifacts represent large or complex data structures like datasets or models and are passed into components as a reference to a file path.
  • If you have large amounts of string data to pass to your component, such as a JSON file, annotate that input or output as a type of Artifact, such as Dataset, to let Kubeflow Pipelines know to pass this to your component as a file.

(If you are coming from Airflow background, it can be related to Xcom Push and Pull operation. )

Demo pipeline (Dummy example)

The kfp.dsl.pipeline annotation helps you to define the pipeline. It takes three parameters name, description and the pipeline_root. The first two are self-explanatory. Pipeline root refers to storage location in GCS or S3 which your project has access to. Sample location should be gs://my-project/your-folder/. This is the location where all your pipeline task outputs will get stored. Here is the dummy flow of this pipeline:

  • Add two numbers and return the result.
  • Connect the add component with divide component by passing the output of the add component to the later.
  • Use the “after” method of divide task to connect with the next component.

Compile and Run

Once you have the pipeline defined, it may be compiled to generate yaml which can be directly uploaded to your pipeline UI, or you can use the kfp.client or AIPlatformClient(GCP) to trigger your pipeline.

Image by Author

Here is the demo pipeline run using the GCP Vertex AI. Here we used simple Python function with base image as Python but based on your needs you can code the individual components using Java/Scala or any other language with appropriate images. The output artifacts are highlighted in this pipeline which is not a component but rather the artifacts from the specific component.

Image by Author

I will wrap up here. In the next article we will discuss an end-end ML workflow implementation.

I hope you found this article useful, and if you did, consider giving at least 50 claps. 👏 :)

--

--

Kailash Thiyagarajan
Walmart Global Tech Blog

Senior ML@Apple. Architecting Machine learning solutions at scale