Machine Learning Operations

MLOps Infrastructure at Mission Lane (Part 2)

Using BentoML to future-proof our Machine Learning Operations

Published in

Mission Lane Tech Blog

7 min readJan 30, 2024

This is the second installment of our 2-part series on MLOps Infrastructure at Mission Lane. In the first part we described in detail our BentoML-based MLOps stack, covering the motivation for switching from our previous home-grown stack to BentoML, the definition of our bentos, the CI/CD process we use to build and test them, and how we deploy them into our kubernetes clusters. This second part will focus on operational use cases, including service observability, how we’re using bentos in conjunction with Airflow to perform batch evaluations over large datasets, and some areas of interest for the future. Let’s go!

Operational uses of bentos at Mission Lane

Live Decisioning

At Mission Lane we make API calls to bentos in a variety of live decisioning/online contexts, including initial credit line assignment, cash flow underwriting, and payment fraud. Live decisioning means that a potential customer is waiting for an answer and response times matter for conversion and customer experience. Fortunately our bento response times are fast enough — 10s-100s milli-seconds for pure model evaluations, a few seconds if computationally expensive feature calculations are required — such that the evaluation of our machine learning models is not a bottleneck for the decision to complete. At our current scale we are making tens of thousands live bento API calls every day, and this volume is going to grow as our customer base grows, and we start utilizing more and more bentos in our decisioning flows.

Observability

The BentoML platform comes with several valuable resources for gaining insights into the behavior of active bento services and their runners. Yatai uses Prometheus to collect and store metrics for any BentoDeployment. These metrics can be aggregated and visually displayed in Grafana. BentoML even provides a nice dashboard template that we have adopted (see Fig. 1).

Fig.1 **Screenshot of the BentoDeployment Grafana dashboard.**

We also make use of BentoML’s OpenTelemetry tracing capability through Grafana Tempo to trace requests through the nodes of our distributed systems, including our bentos. By following a request’s flow and data progression, we can observe and profile the entire system holistically and spot bottlenecks or troubleshoot problems.

Batch evaluation with airflow

One of the major advantages of the BentoML platform is that it unifies real-time/online and batch/offline model evaluation. The same bento can be used to score a single record for live decisioning and a batch of, for example, 10,000 records at once. At Mission Lane we have numerous batch evaluation needs, including:

Daily retro-scores of our current and past models on any applications and inquiries from the previous day (10+ bentos on ~10k records every day)
Occasional full portfolio retroscores of a newly developed model (1 bento on millions of records)
Daily scoring of our CLIP models on cardholders whose statement has cycled (2 bentos on ~100k records every day)
Monthly scoring of large direct-mail marketing datasets (10+ bentos on ~150 million records once per month)

Prior to our move to BentoML we were forced to use Spark to distribute and parallelize the scoring of individual records, but with BentoML we don’t need Spark at all¹. Instead, we are able to score even our largest datasets in a reasonable amount of time by making batch-dataframe requests and in some cases utilizing some explicit parallelization. The orchestration of our batch scoring jobs is handled by Airflow, which is running in the same kubernetes cluster as our bentos. In the following we describe two examples of these batch processes.

Single bento retro-scoring DAG

Data scientists and credit/growth analysts commonly need to reference scores of historical models for our entire portfolio of card holders, or even for declined applications/inquiries. To support this need, we have a daily retro-scoring Airflow DAG (retroscores_acquisitions_bento) that identifies any new applications/inquiries that have arrived in the previous day (since the DAG last ran) and retro-scores each of our bentos on these records. The same DAG can also be used in an ad hoc manner to score a newly created model on the entire portfolio in one go.

Fig. 2 **Structure of a retro-scoring DAG for a single bento**

The retroscores_acquisitions_bento airflow DAG utilizes a separate TaskGroup for each bento, which each contains the following tasks (see Fig. 2):

fetch_feature_names_task: Obtain the feature names required by this bento by making a call to the /feature_names endpoint.
feature_extraction_task: Query Snowflake² to obtain the necessary feature values for all records that need to be scored. We stream the query results into chunks of 50k rows, which are saved to cloud storage as pandas DataFrames in parquet format.
batch_score_bento_task: Load the DataFrames individually from cloud storage and make an API request to the bento’s /score endpoint. The resulting DataFrame of scores is again saved to cloud storage.
The remaining tasks: Copy the scores from cloud storage into a destination table in Snowflake.

In this DAG, the bento TaskGroups can be scheduled simultaneously and run in parallel, but the scoring of each bento is serial — one 50k-row DataFrame at a time. This is fast enough for the datasets that we typically need to score — a full portfolio run (~5M records) of the DAG takes 25 minutes for a single bento (10 min. Snowflake query, 15 min. bento scoring), and the daily runs complete in a few minutes. In the next example we describe a case that requires additional parallelization even at the bento request level.

Direct mail scoring DAG

In order to expedite our computationally expensive monthly direct mail batch scoring process, we implemented two types of parallel computation with our bentos, which allowed us to deliver model scores to analysts in 5 hours or less. Our direct mail marketing process requires a large number of ML models (currently 10) to be scored on a large anonymized TransUnion credit bureau archive extract consisting of ~150 million records, approximately once a month. We have an internal SLA requiring us to make the model scores available to growth analysts within 24 hours of the data becoming available at TransUnion. A straightforward serial evaluation would take 75 hours, so we leverage parallelization. In part 1 of this series I described how we utilize a base bento to make parallel network calls to each of its child bentos. But even with this optimization, a 20,000 row chunk takes about 15 seconds to be scored, which would correspond to a total runtime of more than 30 hours. Fortunately, the problem is “embarrassingly parallel”, which means that we can trivially split it up into sub-tasks that can run in parallel. Airflow is very well suited to handle this kind of parallelization, and we now utilize 10 parallel bento scoring tasks (see Fig. 3), each of which iterate over their assigned 20k row chunks.

Utilizing Airflow parallelization yields a total DAG runtime of merely 3 hours (after about 90 minutes of downloading and pre-processing of the data) for approximately 1.5 billion model evaluations and associated post-processing. This is a remarkable speed increase over the prior (pre-BentoML) Spark-based process, which routinely took more than 12 hours to complete. In addition to being slow, the old process was brittle, and its failures and restarts would often endanger or break the SLA, causing undue stress to the data scientist / MLE responsible for delivering the results. Thanks to Airflow’s native task-retry capability, even occasional unexpected bento errors are gracefully dealt with and are barely even noticed by the process owners.

Fig. 3 **The structure of our direct-mail scoring DAG**

What’s next?

The adoption of BentoML has greatly improved our MLOps infrastructure and has made us a more nimble, more robust, and more future-proof organization. The next important upgrade of our MLOps infrastructure will be the incorporation of the Chalk.ai feature store into our model scoring flows. Chalk will handle feature calculations and serving in both online and offline contexts, and this will allow our bentos to focus on what they do best — ML model evaluations. We also have the exciting possibility to extend our auto-model process (see Auto-model bentos) to automatically deploy newly trained auto-models to become immediately available as bento services. We’re also intrigued by BentoML’s plugin integration with third-party model monitoring services. If our experience with BentoML so far is any indication, we will continue to reap the rewards of adopting it for years to come.

Acknowledgements

We would have never gotten to the current state of our BentoML integration without invaluable contributions from many people. On the BentoML team I would like to especially call out Chaoyu Yang, Bozhao Yu, Tim Liu, and Sean Sheng. Thank you for building BentoML! On the Mission Lane side (current and past) I would like to acknowledge: Stan Bartlett, Joe Bond, Chris Cureau, Alex Daidone, Alex Hasha, Rajat Jatana, Hans Knecht, Esteban Quevedo, Kirill Stolz, plus numerous Mission Lane data scientists and MLEs that have actually built and deployed bentos and worked on the Airflow DAGs. A special thanks to Steve Stevenson and Lani Allen for editing this article.

Footnotes

¹ Note that BentoML does natively support Spark. https://docs.bentoml.org/en/latest/integrations/spark.html

² In the future we will be querying a proper feature store (chalk.ai).