Scaling machine learning monitoring

Published in

Blog Técnico QuintoAndar

4 min readJan 12, 2023

Context

Machine Learning (ML) services are widely used in QuintoAndar to help answer complex business questions and to potentialize the capability of several key features of our ecosystem.

To support these services, the MLOps team maintains and constantly upgrades a distributed and scalable architecture, as described in this excellent post by Lucas Cardozo. One of our recent branches of effort has been to create and propagate active monitoring components with the goal to help data science teams to debug the services' features lifecycle.

Our first project, which is running for about one year, delivered a solution to monitor the quality of the features that arrive in our feature store. We use a thin abstraction that controls an off-the-shelf open-source data quality library to apply validations and generate reports about several features generated by our ETL pipelines.

This year, we aimed to implement a solution to monitor ML services inputs and outputs (I/O), which means, events related to a later moment in the features lifecycle. For I/O monitoring, we have to follow a different path than the first project, as ML services have a critical latency restriction. We understand that it's inadvisable to apply validations or any other process that will slow the service down.

Overall logging architecture

As stated before, our goal was to monitor the I/O of our ML services without increasing the latency of the predictions. With that in mind, we created the architecture illustrated below.

The solution is composed, from left to right, of (a) async producers, that allow any ML service to log data at any moment; (b) a centralized consumer, that read a batch of logged data, defines a structured schema, and store in the appropriate location; (c) integration with visualization tools and data platforms to allow the teams to consult and query the logs.

The producers, used in ML services, is a wrapper that we developed in order to abstract the logic of constructing and connecting to the external resource (Kafka, in this case) and to guarantee a standard for the payloads.

The payloads are sent to an Apache Kafka topic. Kafka is known to be suitable to handle a very large volume of data with robustness and decent fault tolerance. This topic uses a uniform distribution of messages (payloads) to avoid skewness and to be later consumed by Spark leveraging its parallelism. For "n" ML services we have a single topic with "k" partitions. Finally, after we read and process all the available data in the batch, we store it in our datalake that is accessible by our data scientist and visualization and data platform tools.

Engineer tweaks

We learned some valuable lessons with this project that has been running for a few months now. These lessons allowed us to improve the services and balance cost, throughput, and freshness of the logged I/O data. Today, a handful of ML services send, on average, 8 million records a day, and we know that we are ready for much more. In summary, what we applied and learned was the following.

Read data as batches instead of live streaming. Using a streaming job as a consumer is compelling since we need to read from Kafka. However, as the data that we are reading are not mission-critical, we can read the data in periodical batches. To maintain a record of Kafka's offset we use Spark's write stream API with the trigger option "available now". This ensures a single read while keeping the offset updated making our consumer a hybrid of streaming and batch. By reading data in batches instead of keeping a 24x7 live streaming job, we save about 50% in infrastructure costs.
Adapt file writing size and number of partitions in Kafka. As our data volume increased we noticed that single-file writing was becoming a limitation. Our datalake blob storage has restrictions for file sizes, so we applied a limit of records in our stream writer to break the result into several controlled-size files. We also noticed that we can improve Spark's parallel reading capability by adjusting the number of partitions for the topic. We found that 20 partitions are an adequate value for this purpose and the current data volume.
Add an enriched layer to aggregate data. As data science teams started to use the monitored data to explore and build detailed analysis, we noticed that we can improve our system by using a third (gold) layer of our datalake to process aggregations to simplify the querying and processing time for our data scientists and data analysts.

Remarks

The strategy adopted by QuintoAndar's MLOps team for ML services I/O monitoring was narrow. Instead of compromising our engineers with a large, time-consuming scope such as an entire ML monitoring platform full of components, automated analysis, and fancy dashboards, we delivered a scalable, independent, and modular solution that gives our scientists the freedom to apply their analysis and do what needs to be done according to the needs of a very diverse and dynamic area. Our scientist can now create their views with little effort because the data is where it needs to be at the time it needs to arrive.

We believe that we can achieve great goals by evolving our platform step by step delivering solid and reliable components. With the data that our monitoring system processes, scientists can observe common machine learning production model issues like data and concept drifts.

As we proudly watch our progress with every completed project we make ourselves aware that there is much more to do. For what is worth, we are eager to solve more and more machine learning operations challenges.

Scaling machine learning monitoring

Context

Overall logging architecture

Engineer tweaks

Remarks

Written by Ralph Rassweiler