Small Scale Computer Vision in 2024

9 min readDec 19, 2023

Introduction

Over the past decade, many projects involving computer vision (CV) have emerged, both in small scale proof-of-concept projects and bigger production applications. Typically:

medical diagnosis help using radiography, biopsy and other medical images
satellite imagery to analyse buildings, land use, etc.
object detection and tracking in various contexts, like traffic estimation, waste estimation, etc.

Plastic waste identification in rivers by Surfrider Foundation

The go-to-method for applied computer vision is quite standardised:

define the problem (classification, detection, tracking, segmentation), the input data (size and type of picture, field of view) and the classes (precisely what we are looking for)
Annotate some pictures
pick a network architecture, train — validate, get some statistics
build the inference system and deploy it

By the end of 2023, the AI field is stormed by new success coming from generative AI: large language models (LLMs), and image generative models. It’s on everyone’s lips, does it change anything for small scale computer vision applications?

We’ll explore if we can leverage them to build datasets, leverage new architectures and new pre-trained weights, or distilling knowledge from big models.

Small Scale Computer Vision

What we are typically interested in here are applications that can be built and deployed at a relatively small scale:

💰 the cost of development should not be too high
💽 it should not require a monster infrastructure to train (think compute power and data scale)
🧑‍🔬 it should not require strong research skills, but rather apply existing techniques
⚡ the inference should be lightweight and fast, so that it could be embedded or deployed on CPU servers
🌍 The overall environmental footprint should be small (think compute power, general size of models / data, no specific hardware requirement)

This is clearly not the trend in AI these days, as we see models with billions of parameters starting to be standard in some applications. We hear a lot about these, still it’s important to remember that caring about smaller scale / footprint is critical, and that not all projects should follow the scale trends of Google, Meta, OpenAI or Microsoft. Even if they’re not in the spotlight, most interesting computer vision projects are actually at a much smaller scale than the ones making the headlines.

This does not mean that the impact of the application should be small or narrowed, just that we actively care about the development and inference costs.

With this in mind, can we still take advantage of recent developments in AI for our applications? Let’s first dive into the world of foundation models to understand the context.

Foundation models in Computer Vision

New Large Language Models (LLM) have been popular because you can easily use foundation models in your applications (many are open source, or usable through an API). Think about GPT, Bert, Llama as such models. A foundation model is a very large, generic neural network which is useful as a basis for most downstream tasks. It contains knowledge about a very broad range of topics, semantics, syntax, different languages, etc.

In Computer Vision we’ve been using such models for a while: it’s been standard in the last 10 years to use a neural network pre-trained on ImageNet (1 million labelled images) as a “foundation” model for a downstream task. You can build your neural network on top of it, and fine-tune it on your own data if needed.

For the last 10 years, the big question was the performance trade-offs of models on ImageNet

There are two main conceptual difference between ImageNet pre-trained networks and LLMs:

the type of data we train it on: ImageNet relies on purely supervised learning: a large scale classification task, while LLM are generative models: they are trained in a self-supervised manner using raw text (the task is just to predict the next words).
the adaptation of these foundation models to new tasks: ImageNet pre-trained network systematically requires a new learning procedure to be adapted to a new task. For LLMs, while it’s possible to fine-tune the models, the model is powerful enough to be used for a downstream task without any further training, just by prompting the model with the right information to make it useful for a new task.

Most current Computer Vision applications such as classification, object detection, segmentation still use ImageNet pre-trained networks. Let’s review new models that are available or about to be, and could be of use for our Computer Vision tasks.

New foundation models for computer vision: a short review

In the world of Computer Vision, moving away from ImageNet, there’s been many examples of self-supervised networks, some of them being generative models (think GAN and more recently diffusion models). They are trained on just raw images, or image-text pairs (for instance an image and its description). They are sometimes called LVM (Large Vision Models).

(Weakly) supervised vision models on extremely large amounts of data:

DINOv2 (Meta) — A collection of large ViT (vision transformers, 1B parameters) explicitly aimed at being a good foundation model for Computer Vision, trained in a fully self-supervised manner.

Unsupervised depth estimation using DINOv2, with very good image understanding in the wild

SAM Segment Anything (Meta) — a ViT working on high resolution images, specifically designed to be good at segmentation, and enabling zero-shot segmentation (no annotation required to produce new segmentation masks). SAM can be “finetuned” cheaply using LoRA, reducing the amount of necessary training images drastically. Another use case is to use SAM as an additional input in medical image segmentation.

A generic model (SAM) gives a segmentation prior, which is not good enough by itself, but which helps with final segmentation

Vision-language foundation models trained on image-text pairs:

CLIP (OpenAI) — alignment of images and short-descriptions, well suited to low-shot classification and used in practice as a foundation model for various downstream CV tasks
Scaling Open-Vocabulary Object Detection (Google)

CLIP Contrastive learning procedure, to match Image Features with text encoder features

Large Generative Models, which are now multimodal (including large language models capable in their architecture to understand complex text):

StableDiffusion
Dall-E (OpenAI)

Vision-specialised multi-task large models:

Florence-2: unified Computer Vision (Microsoft)

Florence-2: a multi-task vision-language model, with both semantic and spatial understanding

The big bad models, closed source, only available through APIs: Large multi purpose models, not centred on Vision but demonstrating outstanding vision capabilities, with generation capabilities as well:

GPT-4V (OpenAI)
Gemini (Google)
Note that there are many open source, smaller mutli-purpose vision+text chat models being developped as well, for instance LlaVA.

All these models are strong foundation models which cover many vision domains and would be good at discriminative or generative tasks in many contexts. Still, how can they be leveraged in our specific, small-scale context?

Building training datasets

A pragmatic idea using these new models is to keep our standard training pipeline, for instance using a widely used Yolo detector, but improve our dataset by either generating new training images and/or generating annotations. The process is as follows:

A standard dataset consists of an annotated set of training and validation images
An augmented dataset would use a strong general purpose model to add automatic annotations:
1) new annotations to unlabelled images ⇒ this requires a model already suited to the task. You may use a very large general purpose model, carefully fed with examples or prompts, to do zero-shot annotations, or even fine-tune the very large model from your existing human annotations.
2) adding a new layer of information to current annotations, for instance adding automatically segmentation annotations from bounding box information using SAM

Standard training, augmented dataset and generated dataset

A generated dataset would consist of generated images alongside with their annotations. You build a careful prompt consisting of images and/or text to generate thousands of images and their annotations. You may directly use an API to generate these annotated images (the cost should be small compared to finding good images and collecting human annotations).

It’s critical to keep the validation set separated from the generated or augmented set as you want to measure actual performance on carefully human labelled data. This means that in practice, we still need to perform some manual labelling on real images, even if we chose new generative techniques or foundation models.

Examples of augmented datasets

The idea is to start from existing images, and improve the labels, by enriching them or making them easier to annotate. Several data labelling platforms now propose to use SAM or DINOv2 to increase efficiency of labelling by pre-segmenting objects in the picture.

Annotation Tool using SAM: https://www.superannotate.com/image-annotation-tool

Examples of generated datasets

While the idea of generating dataset has been around for a long time, and is widely used to train LLMs, it’s actually quite challenging to find real small scale applications which leverage efficiently generated data (automatic annotations or purely synthetic data)

Without using a foundation model, but rather a simple rendering pipeline, like this synthetic dataset generation or this Roboflow example.
Many have used similar techniques using 3D rendering to generate data, for instance this github repository

3D generated dataset for human pose estimation, clothes segmentation

Using Generative Models to fully generate pictures and annotations. Another example using Dall-E exclusively to generate pictures for a Glove / no glove detection example

Synthesised images and their annotations

The problem with using a CV rendering pipeline to build a dataset (for instance pasting objects to backgrounds for segmentation tasks) is that the quality of the data will depends strongly on the quality of the generated images, so you will have to put a lot of effort in building the right render steps (in 3D even more).

There are not so many successful examples (here is one) of generating datasets using pure generative models yet, but given the rendering quality and steerability of recent image generation AI models, it’s just a question of time and tinkering. It could be possible to use ControlNet starting from existing segmentation masks or contours to generate new pictures for which we already have labels, but it’s unclear whether it would work well with out-of-distribution classes (i.e. not the standard COCO classes), or if the resolution would be good enough.

A similar idea is developed in this paper is to modify existing labelled pictures to generate new ones that share segmentation masks, resulting in a supercharged semantic data-augmentation.

However, the cost (financial and ecological) should be considered when generating thousand of images instead of manually curating and labeling them, it’s not obvious that the benefits outweigh the cost!

Closing words

The way we do modern computer vision through training models with human annotated data is about to be drastically changed by new big foundation models.

Big foundation models sometimes have “nano-scaled” versions designed for inference on lower-end servers or even embedded applications. However they are still far too big for many of these applications, and not so so cheaply tunable to new tasks. In near term, we won’t use 500M+ parameters vision transformers in these cases, but rather smaller, more specialised models.

Still, even for small-scale inference and low resource development, we will make use of large foundation models, either through directly calling these models with APIs/local inference, or using some of these models’ knowledge. Today mostly through helping with labelled data, tomorrow with other means of knowledge transfer — through distillation or LoRA for instance.

Autodistill from Roboflow, a recent method to distil knowledge from Large Vision Models

For detection or segmentation, there is no standard procedure or widespread go-to method to transfer this knowledge from these big or generative models, but it will probably be popularised by 2024!

About the author

Charles Ollion, cofounder @Naia Science

Remark: Transformers or Convolutions?

Most small scale CV applications involve CNNs (Convolutions), while all large recent AI systems involve Transformers. Should we switch from one to the other?

The short answer is not yet. Transformers are used everywhere because they scale better with data size and number of parameters, which is a prerequisite of all new large generative AI systems. They also enable better integration of multimodal inputs as we can easily concatenate context tokens to other (picture) tokens and let the attention mechanisms mix these information to generate good outputs. Finally the development of techniques such as LoRA requires transformer architectures.

But for most real time applications on low end devices, these large transformer models are not yet competitive in terms of efficiency with classical convnets! However, as many are trying to accelerate inference of transformers on edge device, this could change fast.