Building end-to-end MLOps pipelines with Spark ML, MLFlow, k8s/Helm and CI/CD tools

Published in

DataPebbles

13 min readNov 21, 2022

Data scientists or machine learning engineers can implement and train an ML model with predictive performance on an offline holdout dataset, given relevant training data for their use case. However, the real challenge isn’t building an ML model, the challenge is building an integrated ML system and to continuously operate it in production.

In recent years, machine learning techniques have been applied in more and more business scenerios. Many popular machine learning toolkits like Scikit-Learn, XgBoost, LightGBM and Pandas have already reached high level of maturity in helping data scientists or ML engineers build performant models in local development envrionments. As discussed in an artical(MLOps: Continuous delivery and automation pipelines in machine learning) from Google, building a machine learning model is not a challenge any more at this point in time. Instead when building ML systems in production, the main upfront challenages are:

building distributed and scalable machine learning systems to handle big amount of data generated from business
maintaining structured and dedicated development workflows to manage the lifecycle of machine learning models
designing highly automated processes of model monitoring, model version management, continious training, integration and deploying.

Therefore, in this article I will try to discuss about how to build a MLOps pipeline that can overcome the forementioned difficulites, with proper selection of tech stacks.

Please contact us at zhi.li@datapebbles.com if you are interested in the source code and would like to reuse it for your own projects.

Define use case and collect dataset

The San Francisco airbnb listings dataset from kaggle is used. The purpose of this kaggle challenge is to develop regression models to predict the rental prices of listings, based on the given dataset.

The problem itself sounds very basic and simple. But as we have discussed, building ML models is not difficult. Our focus is on building an integrated MLOps pipeline, which can be adapted to more complicated use cases with easy adjustments.

Pipeline architecture

Figure 1. End-to-end MLOps pipeline with Spark ML, mlflow, k8s/Helm and CI/CD

Figure 1 shows the architecture of the MLOps pipeline we built. Let’s take a look at each tech stack in more details:

Spark ML. Since most of the top machine learning toolkits are not designed for parallel architectures, Spark becomes top among one of the few competing big data frameworks for parallel computing that provides a combination of in-memory processing, fault-tolerance, scalability, speed and ease of programming. In addition, H2O Sparkling Water is also used to implement advanced ML models and AutoML. It allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.
Kubernetes. Kubernetes makes the running of Spark applications easy with automated deployment on a deed basis — this, in comparison to having an always-online, resource-chomping Spark setup. K8s also makes moving Spark applications across different service providers a seamless process.
Helm is used to manage Kubernetes applications.
MLFlow. MLFlow is an open-source machine learning management framework created by Databricks. It helps data scientists manage the machine learning lifecycle from experiment tracking, model versioning to model serving.
CI/CD tools. CircleCI is the continuous integration(CI) that enables developers to quickly release code and automate the build, test, and deployment processes. ArgoCD is a Kubernetes-native continuous deployment (CD) tool which can deploy code changes directly to Kubernetes clusters by pulling it from Git repositories.
MinIO is an open source Object Storage tool. It is API compatible with Amazon S3 cloud storage service. MinIO is used to store raw data for model training and model artifacts that are tracked and logged by MLflow tracker.

Code examples

Build classic ML models with pySpark and model tracking with MLFlow

In any machine learning projects, building a ML model consists of 4 steps:

Data preprocessing
Model training and hyperparameters tuning
Model Evaluations
Model Validations

Let’s see how we can follow the steps to build ML models in pySpark.

First, import all the needed packages

from pyspark.sql import SparkSession
import socket
import os

import mlflow
import mlflow.spark
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, GeneralizedLinearRegression
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

from hyperopt import fmin, tpe, Trials, hp
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

Second, same as any other spark applications, create a spark session. Here since we need to load data from S3 storage, so some configurations are set accordingly.

def get_sparkSession(appName = 'MLOps'):
    spark_master = os.environ.get('SPARK_MASTER') # "spark://spark-master:7077" 
    driver_host = socket.gethostbyname(socket.gethostname()) # setting driver host is important in k8s mode, ortherwise excutors cannot find diver host

    spark = SparkSession \
        .builder \
        .master(spark_master)\
        .appName(appName) \
        .config("spark.driver.host", driver_host) \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1') \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
            
    ACCESS_KEY = os.environ.get('AWS_ACCESS_KEY_ID')
    SECRET_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
    MLFLOW_S3_ENDPOINT_URL = os.environ.get('MLFLOW_S3_ENDPOINT_URL')

    hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
    hadoopConf.set('fs.s3a.access.key', ACCESS_KEY)
    hadoopConf.set('fs.s3a.secret.key', SECRET_KEY)
    hadoopConf.set("fs.s3a.endpoint", MLFLOW_S3_ENDPOINT_URL)
    hadoopConf.set('fs.s3.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
    hadoopConf.set("fs.s3a.connection.ssl.enabled", "true")
    hadoopConf.set("fs.s3a.path.style.access", 'true')

    return spark

Then we can read data into the spark session and do some data cleaning. In this example, simple outlier removal, feature selection and missing data imputing are used.

def clean_impute_dataframe(spark, file_uri, keep_cols, impute_cols, impute_strategy = "median"):
    
    raw_df = spark.read.csv(file_uri ,header="true", inferSchema="true", multiLine="true", escape='"')
    base_df = raw_df.select(*keep_cols)

    from pyspark.sql.functions import col, translate, when
    from pyspark.sql.types import IntegerType

    #cast datatypes into doubles & simply remove outliers with price beyond normal ranges
    doubles_df= base_df.withColumn("price", translate(col("price"), "$,", "").cast("double")) \
                            .filter(col("price") > 0).filter(col("minimum_nights") <= 365)

    integer_columns = [x.name for x in doubles_df.schema.fields if x.dataType == IntegerType()]

    for c in integer_columns:
        doubles_df = doubles_df.withColumn(c, col(c).cast("double"))

    for c in impute_cols:
        doubles_df = doubles_df.withColumn(c + "_na", when(col(c).isNull(), 1.0).otherwise(0.0))    

    from pyspark.ml.feature import Imputer
    imputer = Imputer(strategy=impute_strategy, inputCols=impute_cols, outputCols=impute_cols)
    imputer_model = imputer.fit(doubles_df)
    imputed_df = imputer_model.transform(doubles_df)

    return imputed_df

After that, we can start building machine learning models. pySpark supports classic ML models like Linear Regression, Logistic Regression, Decision Tress and Random Forrest. The complete list of supported algorithms can be found here.

There are many ways to decide which algorithms to use and how to tune them. The concepts of model optimization can also raise lots of discussions. So in this article I only give three simple examples of building basic ML models:

basic linear regression model with no tunings
random forest regression model with grid search cross validation
random forest regression model with advanced hyperopt tuning algorithms

MLFlow is used to track model performance metrics and aritifacts.

def run_LinearRegression(imputed_df, labelCol="price"):
    
    train_df, test_df = imputed_df.randomSplit([.8, .2] , seed=42)

    with mlflow.start_run(run_name="LinearRegression") as run:
        
        # Define pipeline
        categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
        index_output_cols = [x + "Index" for x in categorical_cols]
        ohe_output_cols = [x + "OHE" for x in categorical_cols]
        string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
        ohe_encoder = OneHotEncoder(inputCols=index_output_cols, outputCols=ohe_output_cols)
        numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != labelCol))]
        assembler_inputs = ohe_output_cols + numeric_cols
        vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
        
        lr = LinearRegression(labelCol=labelCol, featuresCol="features")
        
        stages = [string_indexer, ohe_encoder, vec_assembler, lr]

        pipeline = Pipeline(stages=stages)
        pipeline_model = pipeline.fit(train_df)
        
        # Log parameters
        # mlflow.log_param("label", labelCol)
        # mlflow.log_param("features", "multiple")

        # Evaluate predictions
        pred_df = pipeline_model.transform(test_df)
        regression_evaluator = RegressionEvaluator(labelCol=labelCol, predictionCol="prediction")
        rmse = regression_evaluator.setMetricName("rmse").evaluate(pred_df)
        r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)

        # Log both metrics
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)

        # Log model
        mlflow.spark.log_model(pipeline_model, "model", input_example=train_df.limit(5).toPandas())

def run_RandomForestCV(imputed_df, maxBins=40, labelCol="price"):

    train_df, test_df = imputed_df.randomSplit([.8, .2] , seed=42)

    with mlflow.start_run(run_name="RF-GridSearchCV") as run:
        
        categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
        index_output_cols = [x + "Index" for x in categorical_cols]

        string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")

        numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != labelCol))]
        assembler_inputs = index_output_cols + numeric_cols
        vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

        rf = RandomForestRegressor(labelCol=labelCol, maxBins=maxBins)
        
        param_grid = (ParamGridBuilder()
                    .addGrid(rf.maxDepth, [2, 5])
                    .addGrid(rf.numTrees, [5, 10])
                    .build())

        evaluator = RegressionEvaluator(labelCol=labelCol, predictionCol="prediction")

        # Pipeline in CV: take much longer time if there are estimators in pipeline, which have to be refitted in every validation
        
    #     stages = [string_indexer, vec_assembler, rf]
    #     pipeline = Pipeline(stages=stages)
    #     cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=param_grid, 
    #                         numFolds=3, , parallelism=4, seed=42)
    #     cv_model = cv.fit(train_df)

        # CV in pipeline: potential risk of data leakage
        cv = CrossValidator(estimator=rf, evaluator=evaluator, estimatorParamMaps=param_grid, 
                        numFolds=10, parallelism=4, seed=42)
        stages_with_cv = [string_indexer, vec_assembler, cv]
        pipeline = Pipeline(stages=stages_with_cv)
        pipeline_model = pipeline.fit(train_df)

        # Log parameter
        # mlflow.log_param("label", "price")
        # mlflow.log_param("features", "all_features")

        # Create predictions and metrics
        best_model = pipeline_model.stages[-1].bestModel
        best_pipeline_model = Pipeline(stages=[string_indexer, vec_assembler, best_model]).fit(train_df)
        pred_df = best_pipeline_model.transform(test_df)
        rmse = evaluator.setMetricName("rmse").evaluate(pred_df)
        r2 = evaluator.setMetricName("r2").evaluate(pred_df)

        # Log both metrics
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        
        mlflow.spark.log_model(best_pipeline_model, "model", input_example=train_df.limit(5).toPandas())
        
        # Log feature_importance
        features_df = pd.DataFrame(list(zip(vec_assembler.getInputCols(), best_model.featureImportances)), columns=["feature", "importance"])
        features_df = features_df.sort_values(by = 'importance',ascending=False).head(10)
        fig, ax = plt.subplots()
        features_df.plot(kind='barh', x='feature', y='importance', ax=ax)
        mlflow.log_figure(fig, "feature_importance.png")

def run_RandomForest_Hyperopt(imputed_df, maxBins=40, labelCol="price"):

    train_df, val_df, test_df = imputed_df.randomSplit([.6, .2, .2], seed=42)

    categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
    index_output_cols = [x + "Index" for x in categorical_cols]
    string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
    numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != labelCol))]
    assembler_inputs = index_output_cols + numeric_cols
    vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

    rf = RandomForestRegressor(labelCol=labelCol, maxBins=maxBins)
    pipeline = Pipeline(stages=[string_indexer, vec_assembler, rf])
    evaluator = RegressionEvaluator(labelCol=labelCol, predictionCol="prediction")


    def objective_function(params):    
        # set the hyperparameters that we want to tune
        max_depth = params["max_depth"]
        num_trees = params["num_trees"]
        with mlflow.start_run():
            estimator = pipeline.copy({rf.maxDepth: max_depth, rf.numTrees: num_trees})
            model = estimator.fit(train_df)
            preds = model.transform(val_df)
            rmse = evaluator.evaluate(preds)
            #mlflow.log_metric("rmse_val", rmse)
        return rmse


    search_space = {
        "max_depth": hp.quniform("max_depth", 2, 5, 1),
        "num_trees": hp.quniform("num_trees", 10, 100, 1)
    }

    num_evals = 4
    trials = Trials()
    best_hyperparam = fmin(fn=objective_function, 
                        space=search_space,
                        algo=tpe.suggest, 
                        max_evals=num_evals,
                        trials=trials,
                        rstate=np.random.default_rng(42))


    with mlflow.start_run(run_name="RF-Hyperopt") as run:
        best_max_depth = best_hyperparam["max_depth"]
        best_num_trees = best_hyperparam["num_trees"]
        estimator = pipeline.copy({rf.maxDepth: best_max_depth, rf.numTrees: best_num_trees})
        combined_df = train_df.union(val_df) # Combine train & validation together

        pipeline_model = estimator.fit(combined_df)
        pred_df = pipeline_model.transform(test_df)
  
        rmse = evaluator.setMetricName("rmse").evaluate(pred_df)
        r2 = evaluator.setMetricName("r2").evaluate(pred_df)

        # Log param and metrics for the final model
        # mlflow.log_param("maxDepth", best_max_depth)
        # mlflow.log_param("numTrees", best_num_trees)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)

        mlflow.spark.log_model(pipeline_model, "model", input_example=combined_df.limit(5).toPandas())
        
        best_model = pipeline_model.stages[-1]
        features_df = pd.DataFrame(list(zip(vec_assembler.getInputCols(), best_model.featureImportances)), columns=["feature", "importance"])
        features_df = features_df.sort_values(by = 'importance',ascending=False).head(10)
        fig, ax = plt.subplots()
        features_df.plot(kind='barh', x='feature', y='importance', ax=ax)
        mlflow.log_figure(fig, "feature_importance.png")

H2O sparkling ML models and autoML

Spark ML is powerful in building ML applications, but it only supports classic ML algorithms and lacks of autoML capabilities. Databricks provides its own automl solutions on spark but it is only avaiable on databricks commercial version platform. With H2O sparkling water, which is open sourced, we can extend the capability of Spark to use more advanced and efficient algorithms such as XgBoost, Stacked Ensembles and Deep Learning. It also has easy-to-use AutoML API to accelerate our model development processes.

Here we show an example of using AutoML with H2O and tracking with MLFlow.

from pysparkling.ml import H2OAutoML
from pysparkling import *


def run_H2OAutoML():
    spark = get_sparkSession(appName = 'H2OautoML')
    imputed_df = clean_impute_dataframe(spark, file_uri, keep_cols, impute_cols, impute_strategy = "median")
    train_df, test_df = imputed_df.randomSplit([.8, .2] , seed=42)

    hc = H2OContext.getOrCreate()

    with mlflow.start_run(run_name="H2O-autoML") as run:
        
        automl = H2OAutoML(labelCol="price", convertUnknownCategoricalLevelsToNa=True)
        automl.setExcludeAlgos(["GLM","DeepLearning"])
        automl.setMaxModels(10)
        automl.setSortMetric("rmse")

        model = automl.fit(train_df)
        from pyspark.ml.evaluation import RegressionEvaluator

        pred_df = model.transform(test_df)
        regression_evaluator = RegressionEvaluator(labelCol='price', predictionCol="prediction")
        rmse = regression_evaluator.evaluate(pred_df)
        r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)


        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.spark.log_model(model, 'model')

You can find more details about H2O AutoML here.

MLflow Tracking and Serving

After running the above codes, MLFlow will track the defined metrics and log models for further uses. In the MLFlow UI you will see:

There are 4 models trained and tracked under the experiment name “MLOps_Experiment”. The RMSE for each are 215.1, 214.9, 214.3 and 206.1 respectively. We can see that the H2O autoML performs the best among the 4 models.

Click on the model details link to go to the details page for all the registered metrics and aritifacts:

Then we can select one model for serving based on certain criterion. Here as an example, we select the RF-Hyperopt model. Click on the “Register Model” button to register and then go to the Models page.

Tag the first version of model “RF-Hyperopt” with Production and then we are able to serve this model using mlflow serving function. Go to your terminal and run

mlflow models serve -m "models:/RF-Hyperopt/Production" --env-manager=local -h 0.0.0.0

A RESTful API service is now available at http://localhost:5000

We can also regiter and tag models using MLFlow Python API, which can help us increase the level of automation in model deployments. More details of model tracking and serving can be found in the mlflow documentation.

Building docker images

Containerization of applications and services is the first step to make your application scalable, k8s ready and easy-deployable.

In this example multiple docker images are used and four of them are built from dockerfile.

spark-master:3.3

FROM ubuntu

RUN apt-get -y update && \
    apt-get install --no-install-recommends -y openjdk-11-jre-headless ca-certificates-java unzip wget && \
    apt-get -y autoclean && \
    apt-get -y clean && \
    rm -rf /var/lib/apt/lists/* 

ENV MASTER "local[*]" 
ENV SPARK_HOME='/spark-3.3.0-bin-hadoop3' 

ENV BASE_URL=https://archive.apache.org/dist/spark
ENV SPARK_VERSION=3.3.0
ENV HADOOP_VERSION=3

# https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

RUN cd / \
    && wget ${BASE_URL}/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
    && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
    && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz  

EXPOSE 8080 7077 6066

CMD /spark-3.3.0-bin-hadoop3/sbin/start-master.sh && bash

2. spark-worker:3.3

FROM spark-master:3.3

ENV SPARK_HOME='/spark-3.3.0-bin-hadoop3' 
ENV SPARK_MASTER=spark://spark-master:7077
EXPOSE 8081

CMD /spark-3.3.0-bin-hadoop3/sbin/start-worker.sh $SPARK_MASTER && bash

3. mlflow-tracking-server

FROM continuumio/miniconda3:latest

RUN pip install mlflow boto3 pymysql

4. pyspark-runtime

This is the main docker image built for this project. All codes and manifests are copied to this image and will be ran to train ML models.

FROM python:3.9.7

ENV MASTER "local[*]" 
ENV SPARK_HOME='/spark-3.3.0-bin-hadoop3' 

ENV BASE_URL=https://archive.apache.org/dist/spark
ENV SPARK_VERSION=3.3.0
ENV HADOOP_VERSION=3

RUN cd / \
    && wget ${BASE_URL}/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
    && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
    && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz  

RUN apt-get -y update && \
    apt-get install --no-install-recommends -y openjdk-11-jre-headless ca-certificates-java unzip wget && \
    apt-get -y autoclean && \
    apt-get -y clean && \
    rm -rf /var/lib/apt/lists/* 

RUN pip --no-cache-dir install --upgrade pyspark nltk mlflow h2o_pysparkling_3.3 hyperopt notebook boto3 matplotlib

ADD ./ /home

WORKDIR /home

After building all the necessay docker images, the next step is to push those images to a docker registry. This can be a public registry like dockerhub, or a private registry like AWS ECS.

Creating k8s manifests and helm charts

When all images are ready to use, we can edit k8s manifests and helm charts to deploy services to kubenetes cluster. See the following screenshot of all the manifests used in this example.

And the structure of helm charts is:

Go to your terminal and run

Kubectl apply -f ./k8s
# or
helm install mlops mlops-chart 

# port forward services 
kubectl port-forward services/mlflow-tracking 5000:5000 -n default
kubectl port-forward services/spark-master 8080:8080 -n default
kubectl port-forward services/minio 9000:9000 9001:9001 -n default
kubectl port-forward services/notebooks 8888:8888 -n default

The application shown in figure 1 is then deployed to the k8s cluster. You can go to correspoding addresses in your browser to check the status of each service.

Please contact me if you are interested in the source code.

CI/CD pipeline in MLOps

In machine learning practices, a CI/CD pipeline is often used to rapidly test, build, and deploy new implementations of ML pipelines. A well designed CI/CD pipeline can help you increase the maturity of ML processes.

In this example, Circle CI and Argo CD are used and we follow the concept of GitOps and use git repositories to manage infrastructure and application code deployments.

As shown in figure 2, the steps of a CI/CD pipeline are:

Develop application codes, dockerfile, yaml manifests at local
Push commit changes to git repository 1
Circle CI detects changes in repo 1 and start user defined automated testings
Once all the tests are passed, circle CI builds docker images and pushed to a docker registry. At the meantime, push commit changes of all k8s manifests or helm charts to repo 2
ArgoCD detects changes in repo 2 and starts auto synchronising the deployments in the managed k8s cluster

Setting up Circle CI

In your project repository, create a config.yml file under the path .circleci, and then follow the official document to link your git repository to CircleCI. Then CircleCI is able to run your defined CI steps every time you push changes to your git repository.

In CircleCI application page, you will be able to see the status of the tests for each commits you pushed.

Here is a simple example of config.yml, which defines a docker image build step in circle CI. In production, there will be much more complicated steps to be defined.

version: 2.1

jobs:
  docker-build: 
    machine:
      image: ubuntu-2204:2022.04.2
    resource_class: large
    steps:
      - checkout
      - run:
          name: Build docker images
          command: |
            docker build -t spark-master:3.3 -f ./Docker/Spark/Dockerfile .
            docker build -t spark-worker:3.3 -f ./Docker/Spark-worker/Dockerfile .
            docker build -t mlflow-tracking -f ./Docker/mlflow/Dockerfile .
            docker build -t pyspark-runner -f ./Docker/pySpark-runner/Dockerfi
workflows:
  Step1: 
    jobs:
      - docker-build

Setting up ArgoCD

Follow the getting started guide from ArgoCD to install it to your k8s cluster

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Once installation is done

kubectl port-forward svc/argocd-server -n argocd 8080:443

Then you are able to reach the Argo UI at localhost:8080. Login and press the “+ NEW APP” button on the upper left corner of the page

Fill in necessary infomation of the application including the address of your git repository. Then argoCD is ready to manage the automated CD process of your ML application on k8s cluster.

Future improvements

In this example, a static dataset is used when building a MLOps pipeline. But in real businesses, new data is contantly coming in and keeps changing (drifting). Continious model training and model drift monitoring are the keys in MLOps.

So we will still work on integrating kafka and Spark streaming into the MLOps pipeline. Accordingly a CT and monitoring process will be added to it.

Additionally, we are also working hard to make our source code production ready and open sourced. If you are intrested, please subscribe to Datapebbles publications to be notified.