To Update or Not to Update: Managing Production Models in ML and CV

Anton Maltsev
7 min readJun 13, 2024

--

Have you ever had a situation like this? A new neural network is released, and the whole management demands it be implemented. Half of your colleagues are excited about the new layer that improves the network’s accuracy. YoloV(N+1)? LLAMA100?

Everyone in the company is excited about the integration of a new YOLO model. DALLE-2 generated image.

Qwen2 was recently released. YOLOv10 as well. Is it time to change?
There is no more hype with ML than the release of new models. Every video about the release of a new YOLO gets tons of views. Everyone wants to change their models! What should you do? Implement?

In this article, I will explain how to avoid giving in to hype and making informed decisions.

Let’s consider replacing a network with a similar one.
If the network can do something fundamentally new, it is more likely to be the competence of product owners. And the question there is: “Do we need new functionality?”

So, let’s discuss situations like:

  • YOLOv(N)->YOLOv(N+1)
  • MobileNetV(N)->MobileNetV(N+1)
  • LLAMA->LLAMA2

By the way. This article is availiable as a video on my YouTube channel.

Objective

Let’s start with the most crucial point. Define the purpose of implementing a new network. The goal of “change for change’s sake” is an abysmal return on investment. When a researcher comes in with his eyes lit up, it doesn’t look good for the business. What sounds a little bit more appropriate?

  1. Your current neural network is terrible. The system needs at least X accuracy, but you have only Y.
  2. You are always trying to optimize your models. There is a separate budget for this (there may be no logic here).
  3. Your accuracy is directly related to your income/expense.
  4. You need to optimize model speed so you have a time budget to test a few new models.
  5. LLM-based models — knowledge about the world around you, new objects, and events
  6. You need to swag and emulate AI activity.
  7. You haven’t chosen models yet and need to analyze your options.

There may be a couple or three other reasons. But it all comes down to situations with a direct link between accuracy and money. Reasons 2 and 6 — let’s not even consider them. It has nothing to do with ML. There’s nothing to add about knowledge of the world around you, either. If it’s critical, it needs to be changed.

Let’s start with cases related to accuracy. It’s important not to get confused:

Accuracy is always directly related to speed. You need to compare and make decisions for similar models.

If you see that a new super big transformer has been released, you should only test it if you have the same big model running in your prod. You can compare it with ResNet-18, of course. But the main question is, why have you yet to do it before? If you can make the model heavier, do it now.

“You need to improve accuracy to ΔX% to be successful.”

Sometimes, a product owner comes to you and says, “Our MVP isn’t taking off. We need to improve the accuracy by ΔX before our system will work!”. Why not? Let’s discuss how significant can be this ΔX margin.

Changing a model to a similar performance model of roughly the same generation will never give a significant boost.

It is better to check this on the custom dataset (COCO is too complicated for analysis in different domains), and here are a few samples. Here, yolov8 is better than yolov9. Here, yolov9 is better than yolov8. What to believe? Both! It depends on the dataset. And the difference is negligible.

Replacing ResNet (2015) with ConvNext (2023) may significantly reduce errors. But it will rarely be “reduce error by a factor of 2”. For a comparable model size of the same age, it’s somewhere around 5%-10%. The difference will be minimal if you change similar models for neighboring years:

ResNet and ConvNext models have eight years between them. The picture is from an Arxiv article — https://arxiv.org/pdf/2201.03545 (A ConvNet for the 2020s)

ResNet-50 (76.1) vs ConvNeXt-T (82.1).
It reduces the error by 6% from the initial ~24%. This will leave ¾ errors the same (~25% error reduction). It’s not a silver bullet that will improve your system x2 or x3 times. If you have 99.6% accuracy for ResNet-50, your accuracy after ConvNext will probably be 99.7%.

But what about the beautiful graphs? Yolov10 is the best!!!

YOLOv10 Article — https://arxiv.org/pdf/2405.14458 (YOLOv10: Real-Time End-to-End Object Detection)

Look at the numerous comparisons of YOLOv8|YOLOv9|YOLOv10, etc: 1, 2

This accuracy is not where you expect it to be:

  1. The detection boundary is a little more accurate.
  2. Detection works a little bit better for tiny objects.
  3. Slightly more stable for larger objects.

The same with LLM. Better metrics on some specific dataset? Maybe because of other languages, but you don’t need them in production. And for 95% of the distribution, nothing will change. The dataset primarily determines LLM. An LLM model won’t understand medicine better if you don’t have medical datasets in training.

Result. You will reduce the error rate by 5–10% by swapping models. The best case is a 30–40% reduction using a super-old model. If your ΔX metric is within this percentage — it’s ok.

But, in my experience, everyone usually needs x3-x4 error reduction in case “everything is not working now.”

Performance improvement as a process

You want to improve your model’s performance since mistakes directly affect your income. I congratulate you! This is an infrequent task for data scientists, especially when you can spend time optimizing networks and processes. 90% of companies are more about “do it fast” / ”do only crucial things.”

I want to remind you that before choosing a more powerful model, you should check several other approaches that often add more accuracy:

  1. Have you researched errors? Can you collect more examples with problems? Adding current production errors to the dataset is the best way to improve accuracy. This can lower the error even by several orders of magnitude.
  2. Have you tried training more powerful models? This allows you to estimate the maximum accuracy/how close you are to it. If such a model is much more accurate, you can train your model through distillation.
  3. Have you tried optimizing augmentations for your task? Often, people forget that you can maximize accuracy through augmentations by looking at errors. Or vice versa, disable augmentations that degrade quality.
  4. Have you tried optimizing LR/optimizer/loss functions? This is often forgotten, but more accuracy may lurk here than changing the model.

If you have already tried all these things, it makes sense to test other models.

Case: “Сhoosing a model from scratch” / “Optimizing the performance for existing pipelines.”

These are the only two places where testing multiple models is meaningful. But I suggest optimizing the dataset and augmentations before getting involved here.

Other stuff you should check before checking a new model

Quantization — Different modules will often have different accuracy levels after quantization. And a top-performing model may stop top-performing. Don’t forget to check everything after quantization.

Different architectures. Sometimes, another architecture works better for some tasks. Segmentation is sometimes better than detection. CLIP/DINOv2 sometimes gives better accuracy on Metric Learning tasks than training from scratch on a small dataset.

Preprocessing. Sometimes, the right choice of data preparation can affect speed/accuracy. Standard YOLOv8 YOLOv5 preprocessing is like this:

A square with black borders. Photo by Author.

If you use rectangles and preprocess them, it will be faster on inference (and speed is accuracy):

Original image. Photo by Author.

Data preprocessing can improve your model in many other ways.

The difference between train and prod. Remember that a correct training sample is more important than a more performant model. Keep the training dataset up to date.

Want to maximize quality with fixed performance? Remember that there are ways to find the optimal architecture on hardware. There are also several companies that specialize in this (https://deci.ai/, https://enot.ai/ ). There are some open-source solutions (1, 2). However, companies are usually expensive, and the open-source is unstable.

A little summary: Does your model need to be updated? In general, you should. But there is more hype than real work.

Here is the little table that you should check before switching models:

Do you still have any questions? Do not hesitate to ask them here or:

And don’t forget to check my video about YOLO(N) vs YOLO(N+1) comparison:

--

--