MLOps in Glassdoor: an à la carte approach

Zixin Wu

Published in

Glassdoor Engineering Blog

8 min readJan 12, 2022

Authors: Zixin Wu, Srinivasan Ramaraju

Introduction

Data Science and Machine Learning (ML) have become an essential toolkit for businesses to increase their competitiveness and provide greater value to their customers in recent years, largely thanks to the abundantly available large data sets, more cost-effective computation resources, and advancements in ML algorithms. In Glassdoor, we utilize ML to build intelligent products which help our job seekers find the jobs and companies they love.

While ML can unlock new capabilities for businesses, it comes with its complexity and challenges compared to traditional software engineering, mainly due to data dependency and team skillset shift.

Data dependency: a major reason for using ML approaches to build complex software is exactly because the desired behavior of the software is so complex that it’s more efficient to be driven by data. This dependency brings new components in the software development process:

Instead of simply compiling source codes, now the software (ML models) needs to be trained/retrained by (usually a large amount of) data before it can be used to serve clients
Evaluation and test of ML model behaviors is data-centric and often involves offline experimentation
Version tracking of ML models needs to consider both code and data changes
The data could change over time due to various reasons, such as the way the data is collected or processed, or customer behavior changes which, in turn, could be driven by the effect of using the ML model. This essentially creates a hidden feedback loop. All these changes in data need to be monitored and acted accordingly.

Team skillset: in many companies, the majority of the ML team members are data scientists or ML researchers, specialized in dealing with data and ML algorithms. However, to integrate an ML solution to existing systems and continuously operate it in production, many more components are needed besides ML model training and evaluation. Automating and standardizing those components can better utilize the expertise of ML team members and improve productivity.

This is how the concept MLOps comes into play. It’s an engineering culture, practice, and methodology to apply DevOps principles to ML systems. Under this concept, end-to-end automation and monitoring solutions are built to improve the productivity and reliability of building and operating ML systems at scale. MLOps solutions usually provide components such as experiment orchestration, metrics tracking, model registry, deployment automation, and model performance monitoring to address the above-mentioned challenges.

When implementing MLOps in Glassdoor, we took an à la carte approach instead of adopting a single end-to-end solution. There are several reasons behind this. MLOps is a relatively new concept and no single solution is mature enough to satisfy all our needs. An à la carte approach gives us the benefits of making no compromise when picking the best option for each component individually, minimizing the learning curve where possible, having higher flexibility in migration timing, and a quicker feedback loop. In the next sections, we are going to highlight some details of what we have done in these components.

Components

This diagram is made based on the one in this article

With more flexibility in implementing MLOps, we looked into the components where automation and standardization can give us the biggest boost of productivity. The following sections reflect our implementation timeline based on the priority of those components.

Real-time model serving

In Glassdoor, most back-end systems are written in Java, thus deployment pipelines are usually Java-centric. On the other hand, ML programs are mostly developed in Python, and ML developers lack experience in Java. This creates a skill gap for integrating ML solutions into existing systems. In some instances, we had to rewrite ML models from Python to Java for seamless integration. This is definitely not an efficient and sustainable approach.

To overcome this bottleneck, we developed an automation pipeline to deploy ML models as REST services in Kubernetes which provide language-agnostic access, based on MLflow, Jenkins, and Spinnaker.

MLflow: unlike Backend Engineers, ML developers don’t have the luxury of using tools like Artifactory or sophisticated CI-CD pipelines to manage the ML development lifecycle. MLflow provides us with flexibility and perfectly fits our use case. It offers UI components for tracking multiple experimental runs, packages ML models, manages package dependencies, and builds real-time serving services. Historically, our teams had disparate tools for model storage, versioning, and writing their own REST services. With MLflow, we’ve created a unified experience amongst our teams.
Deployment: Glassdoor has been actively migrating its batch workloads and stateless services to Kubernetes. As we envisioned our new MLOps platform, we wanted to run the inference service as a container workload, which gives us greater flexibility in managing our infrastructure environments at scale. To conform with our security practices at glassdoor, we didn’t use the command line utilities from MLflow as it uses community-published base images, instead we chose to write our own dockerfile that bakes MLflow libraries and registered model artifacts to build the docker image of the inference service. The docker image builder running on Jenkins pulls the latest registered version of the model to bake a new container image and pushes it to our private container registry on AWS. We use Spinnaker to deploy machine learning apps to the cluster. It makes use of a helm chart which creates hpa, ingress, deployment, sa, and service for the app to be deployed.
Performance, logging, and system monitoring: each of our inference services typically can respond within low double-digit milliseconds under the load of 20k-60K requests/second. All of the Kubernetes Pod logs from the cluster are forwarded and collected in our centralized logging solution. We also export all the Pod metrics and app metrics with the help of Prometheus exporter, and utilize Grafana for visualizations and alerting.

Looking ahead, we believe in infrastructure as code and aim to unlock our CI-CD capabilities with GitLab. We’ve started to migrate our build and deployment pipelines from Jenkins and Spinnaker to GitLab where we foresee a complete GitOps model where our engineers can do it all at one-stop.

Batch model serving and model training

Next, we looked into use cases where model predictions are done in batch mode. Platforms supporting this kind of use cases process data in bulk and throughput is more important than individual requests’ response time. We use a heterogeneous solution to accommodate different situations.

Jenkins: when a relatively simple and lightweight ML model is used, Jenkins can read from repo a Python code which loads the model through MLflow’s API from a model registry and run or schedule it. Our in-house Platform-as-a-Service solution can create custom Jenkins jobs and bootstrap them on desired node types. The node type, python version, and training scripts are checked in a config-like manner. This solution has a very gentle learning curve for data scientists to run their ML models in production.
Airflow: if the workflow of using ML model(s) is more complex, we use Airflow to orchestrate steps in this process. Multiple models can be involved in a DAG and dependencies, conditions, and parallelism can be specified. Airflow DAGs are copied to the Airflow scheduler server by Jenkins once they are merged into Master branch and Airflow tasks run on Kubernetes as Pods.
Spark: on the other dimension, if we need to scale up the data size or processing throughput, we run models on Spark to utilize its distributed computation capability on a cluster of machines. Spark provides ML-oriented libraries such as MLlib and Spark NLP which are handy for our ML projects.

ML model training can leverage the same paradigm explained above. Instead of loading ML models by MLflow’s API, we submit the trained models to MLflow’s model registry for versioning and management purposes.

Model development

We utilize MLflow to track experiments, log metrics, and register models. MLflow offers out-of-the-box support for many popular libraries including XGBoost and TensorFlow. For custom models, we utilize the mlflow.pyfunc module to run pre-processing steps on generic MLflow models. We can then package these models with custom code, packages, and data (usually small). Generic models have custom inference logic and are loadable as a python_function model. This function allows us to deploy them as REST endpoints. This makes testing easier and enables faster feedback. Teams have taken advantage of MLflow CLI commands to test locally. For example:

mlflow models serve -m s3://path_to_model/ -p <port_number>

Besides MLflow, we evaluated multiple other options for our MLOps implementation, such as Kubeflow, Qubole, Luigi, and Argo CD. Kubeflow is a particularly attractive option for its native support for Kubernetes and a wide range of features. We didn’t move on with Kubeflow at this moment, mainly because Kubeflow focuses on supporting complex ML workflow through Kubeflow Pipeline but our use cases are not sophisticated enough to take advantage of it, while Kubeflow involves a steeper learning curve compared to many tools already in place such as Airflow and Jenkins. But some features such as hyperparameter tuning, notebook hosting, and TensorBoard are definitely appealing for future consideration.

We also looked into SageMaker for our ML development step. It provides a good amount of ML tools and is a leading player in this field. Currently, most of our use cases did not demand model development be done in a remote environment due to data size or computation complexity, so we can simply do it on developers’ workstations to take the advantage of the ease of use and quicker turnaround. In some cases where it does require a remote environment, we have a few dedicated ML servers with powerful CPU or GPU for ad-hoc tasks.

Feature Store

Feature store serves as the interface between models and data. It standardizes the way to store, discover, access, and track ML features. In a nutshell, we use SageMaker Feature Store as the backbone and built a metadata management layer and feature access layer on top of it because we need more metadata management capabilities than what SageMaker Feature Store provides and flexibility of managing features in other locations/solutions. The details of this can be another blog in the future. If you want to be notified when it’s published, please subscribe to Glassdoor’s Engineering Blog.

Model monitoring

As mentioned at the beginning of this article, a unique challenge of building ML systems is the continuously changing data that ultimately drives the behavior of ML models. It’s critical that we track and monitor the data and model performance through predefined metrics to detect data drift and concept drift in order to maintain the predictive power of the models.

We use the solution whylogs provided by WhyLabs for its lightweight integration and ease of use with their dashboard. Our next step is to integrate this solution with OpsGenie for alerting, and automate model retraining when certain thresholds are met.

Summary

With these automation and tools in place, our data scientists and researchers in the ML Science team can focus on data analysis and model development, while our engineers in the ML Engineering team continuously build platforms and tools to support our ML initiatives. MLOps serves as the interface between these two teams with different skillsets, greatly improved their productivity, and made scaling up the teams much easier.

A lesson we learned from this process is that it’s important to pick the options best suitable for our situations and make them start working for us.