Revving into ML observability: a data generalist’s take

Anastasiia Kulakova
Jedlix Tech Blog
Published in
12 min readAug 2, 2023
Source: Egle Plytnikaite — Electric Vehicle

At Jedlix, we’re always pushing the boundaries of sustainable transportation, whether it’s a sunny day and our users are using our solar features or during the high-energy demand of winter. Machine learning models play a crucial role in this process, as we discussed in our previous blog post, and that is why it’s crucial to understand how they work and perform. Recently, we embarked on a journey to build a monitoring platform that our small but diverse team can handle on our own. In this blog post, we’ll share the twists and turns we encountered in our ML observability journey and discuss the important steps we took to get started. These steps include scheduling, collecting metrics, logging requests, and responses, and visualizing data in Grafana.

But first, why do you need observability?

Source: Bridging AI’s Proof-of-Concept to Production Gap — Andrew Ng, my adaptation.

Shipping the model to production is not the final destination but a crucial step in the lifecycle of MLOps. It is important to ensure that the model delivers its promised (during the model testing phase) performance and uptime, bringing the desired value to the business. In our case, the goal is to make EV charging smarter and more affordable, and each model contributes to this mission. If the model fails to meet these expectations, it is important that it doesn’t fail silently. Developers should be notified, and analysis can be performed on the outcomes, enabling the team to prioritize model improvements. Basically, if we look at the picture above, observability is exactly what enables the loop between “deploy” and “train” or “deploy” and “collect data.”

Consider your car as an EV driver. You rely on a certain range and speed and expect to be notified when your battery is low so that you can plan accordingly. Now imagine the need to observe thousands of cars… That’s exactly what we’re doing! In summary, powerful models come with great responsibilities, and implementing observability practices is the first step in ensuring the responsible treatment of your models.

Which signals are important to monitor?

When it comes to monitoring metrics and taking action based on them, real-life production systems can be complex, generating numerous signals. Similar to a car’s dashboard, where various measurements are available but only a select few are displayed, choosing the signals to include in your monitoring is important.

Exploring metrics: a journey from the surface to the depths of the iceberg

Let’s dive into the process by visualizing performance as an iceberg with multiple layers. In the drawing above, these layers are arranged based on visibility: starting with the ones easily measurable but unlikely to be the root cause of the issue on the surface and going deeper to uncover those that are hard to measure but crucial for understanding and debugging your model.

  1. Business Metrics: These metrics are directly linked to your business goals and outcomes. They may include things like financial metrics, user count, and customer satisfaction ratings — basically, the vital signs of your business. They give you an answer to the question, “How’s the business doing?” While they’re widely known among teams and serve as the starting point for almost any investigation, we’ll skip them in this blog post and focus on the metrics established by data science and engineering teams.
  2. System Metrics: These metrics provide insights into your model’s overall health and performance within the infrastructure. In the end, what’s the impact of your ML model if its API endpoint is down? They keep track of important factors like API response time, system availability, and resource utilization. In simple terms, they answer the question, “How are the applications that drive our business decisions performing?”. These system metrics are close to the tip of the iceberg, but their visibility can vary depending on how tech-savvy your data team is.
  3. Proxy Metrics: These metrics act as middlemen, giving insights into your model’s behavior and performance without directly measuring the business outcomes. They become especially handy when the outcomes themselves take time or depend on various factors. For instance, you can look at click-through rates instead of actual sales in an online store or evaluate performance in ideal simulation scenarios. Proxy metrics help us answer questions like, “Is my model showing improvement in a metric closely related to performance?” or “How well does my model fare in simulated scenarios?” Just like performance metrics, keeping an eye on proxy metrics can reveal flawed assumptions in our model design or raise concerns about data quality.
  4. Model Performance: These metrics gauge how well your model is doing in terms of accuracy, mean absolute error, and other evaluation criteria. They help answer questions like, “How is my model behaving?” and “Does it meet the desired performance standards?” Now, here’s the catch — this part is mostly hidden beneath the surface of the iceberg. Part of it is because of data availability and the complex nature of the use case, which we already covered in the proxy metrics section. There’s one more thing to it: when starting with machine learning, it’s common for engineering teams to focus on testing the model with a separate dataset and put off monitoring its performance in real-world scenarios. So, don’t miss this important step!
  5. Data Quality: Keeping tabs on data quality is essential for identifying and resolving issues related to the input data used by your model, as model performance rarely degrades on its own. This involves assessing factors like data completeness, consistency, correctness, and timeliness to ensure the model operates with reliable and accurate data. Common questions that arise include, “Are my model’s missing features increasing? Has the freshness of my features decreased?” Sometimes, the issue may not lie with the model itself but rather with incorrect data quality and availability assumptions. Investigating data quality can guide improvements in data engineering, prompt exploration of streaming architectures, or even necessitate retraining the model to account for less fresh features.
  6. Distribution Drift: Distribution drift refers to monitoring changes or shifts in data distribution over time. It helps us detect when the training data used by the model deviates from the real-world data it encounters during deployment. This misalignment can affect the model’s performance and indicate the need for retraining or adaptation. The key question to ask is, “Does my data still resemble what it was before?” In certain cases, this analysis uncovers generalization issues or prompts us to explore scaling the target variable. In more extreme scenarios, a significant change in the target distribution may lead us to reconsider the entire business case. For this blog post, we will skip this part, as this part isn’t yet automated on our end.

By watching these different layers of signals, you aim to get as complete as possible picture of your model’s performance, system health, data quality, and any unexpected deviations. This valuable visibility lets you be proactive and prioritize improvements, ensuring your model delivers value effectively. Now, the question is, what tools would you employ to monitor these signals? Especially knowing that the more important for debugging, the harder to measure…

Monitoring platform: Build vs. Buy (personal opinion)

When it comes to choosing a monitoring platform, the decision between building one in-house or purchasing an existing solution can be quite challenging. While there’s no “one-size-fits-all” solution, there are several factors to consider in this decision-making process:

  • Cost and Return on Investment (ROI): Companies often pay a fixed monthly or annual cost for unlimited access with third-party tooling. On the other hand, building an in-house solution may involve assembling different components and incurring separate expenses. This way, you can decide which components you need and which you do not and pay only for what you actually use. The latter option might sound appealing, but…
  • Time: Regardless of the approach chosen, incorporating a new platform into existing infrastructure requires time. While larger organizations can allocate resources to build and maintain their own platforms, this isn’t feasible for most companies, especially startups. As a data science team, balancing core analysis and model development with infrastructure management is important. This is what we call the opportunity cost. According to Qwak, one of the many MLOps platforms, data scientists spend as much as 65 percent of their time on such tasks. Although third-party observability platforms claim to free up data science teams’ time for more important work, we believe it’s often just sales talk, and the ROI of integrating these platforms can be overestimated, particularly regarding flexibility.
  • Flexibility: When using a third-party tool or platform, there’s a possibility that it may not integrate smoothly with the other tools your team uses. In contrast, an in-house platform can perfectly align with your organization’s specific needs. For example, you can build around the tools supported by your cloud provider or collaborate with your development team to customize their tools for data science requirements. However, it’s important to remember that automation can sometimes take on a life of its own. After a while, you might spend more time on automation than on the tasks you intended to prioritize, as explained in this insightful blog post.
Source: ML Observability — Hype or Here to Stay, my adaptation.

You could reason about the Build vs. Buy dilemma by dividing the companies into four categories:

  • The Leaders: Established, often Big Tech, companies heavily reliant on machine learning with numerous models in production. Likely, they started with MLOps practices way before the rest of the groups and thus had to develop their own tooling.
  • The Innovators: Companies where ML is still essential, but they have fewer use cases and models in production, resulting in lower demand for observability tooling and a more cautious approach to spending. Legacy data and infrastructure tooling pose challenges to advancing ML maturity. Thus, outsourcing observability to third-party solutions could actually help to free up resources and focus on the nature of the models rather than infrastructure to advance in the race.
  • The Experimenters: Companies exploring various ML use cases, creating a need for Observability tooling at scale. Willingness to spend is moderate as ML is not deemed mission-critical, but the emergence of AI-native startups may push them toward becoming leaders in the future. Here, build vs. buy is a big topic of discussion.
  • The Laggards: Companies that are slow to adopt ML and generally do not use ML technology, making them resistant to implementing ML tooling. When the question arises, these companies could be better if buying ML observability software to ease ML adoption.

In short, we believe going for “Buy” is a better option for companies in “The Laggards” and “The Innovators” quadrants. As “The Experimenters,” we found a sweet spot by repurposing engineering tools to meet our data science needs. In the next section, we’ll delve into the steps we took to get started.

How to start building with ML observability in mind

Now to the fun part: building your DIY platform. Fast-forward to the result: here’s our initial ML observability setup in the diagram below. In the next section, we will break it down into multiple parts and explain how we arrived at it.

Jedlix ML observability set-up: simplified view

Considering your deployment: batch or online?

Before diving into ML observability, it’s important to consider your deployment strategy. Determine whether your model is deployed in a batch or online setting. At Jedlix, for example, we mostly deploy our models online. If external systems or users are not yet utilizing your model, or, simply, it’s not called often enough to draw statistics based on the outputs, take the initiative to call it yourself. For that, multiple pipeline orchestration solutions can be used: take a look at Airflow, Dagster, and Prefect, which are widely used in data teams for various workflows. Configuring alerts is also very straightforward so that you will receive a Slack message if the call to your model API fails. In this case, we decided to go with Dagster, as it offers this valuable decentralization that allows us to experiment without the fear that our new jobs will break the old ones.

Treating your model as software: system metrics with Prometheus

As we delve deeper into the iceberg, we encounter the realm of system metrics. After all, what’s the point of a model if its API is on the fritz? Tools like Prometheus or Datadog can boost your model’s observability by keeping tabs on its API behavior. Set up alerts for timely notifications when the model goes offline or takes ages to respond. The great thing about these tools is that they are already the industry standard, so even if your analytics team isn’t using them yet, chances are that the development team knows their way around it very well.

Also, you’re not limited to existing metrics alone; you can create your own! For example, you can track a data quality metric like “Missing feature X” whenever your model receives an NULL input. Just remember defining metrics in your code requires a bit of effort. If you have too many features to care for and prefer a smoother ride, consider using a third-party MLOps platform to handle these tasks. When working with metrics, keep an eye on retention periods and how alerts are calculated. Will the count reset after an incident, or will it keep growing? Consult the documentation of your chosen metrics software for more details. And speaking of visualization, we'll circle back to that topic in the next leg of our journey.

Setting up request-response logging & analytics database

Request-response logging to ClickHouse

When it comes to monitoring model performance, it’s all about recording your model outputs and comparing them to the actual results. While searching for a robust analytics database, we found ClickHouse to be a game-changer — blazing fast, user-friendly, and with valuable custom aggregation functions. We already ingested our actual data to ClickHouse; however, one thing needed to be sorted out before we could start writing our predictions there. As ClickHouse's best practises explain, each insert sent to ClickHouse causes it to create a part on storage containing the data from the insert together with other metadata, thus, sending a smaller amount of bulk inserts is more efficient than larger amount of inserts that each contain less data. But how do we deal with it when writing single outputs is exactly what we need to do?

Instead of single inserts, we leveraged request-response logging via Envoy sidecar, seamlessly synced to ClickHouse using Vector. This way, we effortlessly get the raw data without the hassle of manual batching and later process it into tables containing valuable performance, proxy, and data quality metrics. The good news is that handling these streaming tasks usually falls within the realm of expertise for your development team. Plus, working with JSON data in ClickHouse is a breeze, making the process smooth for the analytics team.

Visualization and alerting in Grafana

Visualization is a powerful tool for understanding how the model is doing. And with a user-friendly tool like Grafana, you can easily create charts and graphs showing what’s going on. But it’s not enough to just look at the visuals — you also need to set up alerts to be notified when something important happens. Grafana lets you do this by setting up labels that send alerts to the right people on Slack. While standard metrics like MAE and MAPE are useful, it’s always a good idea to explore additional metrics that can give you an even deeper understanding of your model’s behavior. Currently, we use Grafana for both system and performance metrics, with specific metrics for each model. And to make dashboard creation easier, we set variables like the model name and API route, which makes it easy to duplicate and customize the dashboard. If you’ve done this already, congratulations — your ML observability setup 1.0 is now good to go!

In place of a summary: embrace a generalist mindset

As we reach the end of this blog post, let’s wrap up with the three key takeaways from our observability journey:

  1. Think like a generalist when setting up your initial observability — it’s all about maximizing the resources you have.
  2. Use the 80/20 rule to prioritize the essentials and be open to refining your setup along the way.
  3. Don’t hesitate to seek help whether you’re using one vendor or multiple ones. Solution architects and friendly communities on Slack and Discord are there to assist. We would especially like to thank those of ClickHouse and Dagster.

By embracing this generalist mindset and tapping into diverse expertise, you’ll craft a customized observability framework that empowers you to monitor and optimize your machine-learning models effectively. At least, that’s how we revved into ML observability at Jedlix!

--

--