How Object Detection Evolved (Part 4)

From Region Proposals and Haar Cascades to Zero-Shot Techniques

Andrii Polukhin
8 min readJul 11, 2023

Object detection algorithms have advanced from early computer vision to deep learning, utilizing various methods in modern systems for accurate detection.

This blog post serves as a subsequent installment in my ongoing series titled “The Evolution of Object Detection: From Region Proposals and Haar Cascades to Zero-Shot Techniques.”

For a comprehensive understanding of the subject, I highly recommend exploring the preceding and next sections of this story, which can be accessed through the following links:

Without further ado, let us resume our expedition through the realm of object detection.

(Zero | One | Few) — Shot Object Detection

In the development of algorithms for object detection, we are slowly moving toward the topic of Few, One, Zero—Shot Object Detection. In this section, we will focus less on technical details and more on a higher level, providing a simple description of the idea of how zero-shot object detection can be performed.

Multimodality

Figure 1. Workflow of a typical multimodal. Three unimodal neural networks encode the different input modalities independently. After feature extraction, fusion modules combine the different modalities (optionally in pairs), and finally, the fused features are inserted into a classification network. | Source: Multimodal Deep Learning: Definition, Examples, Applications

The key concept in this context is multimodality, which means that a neural network can understand several types of data simultaneously. For example, it can be a combination of image and text, image and sound, text and sound, or even image, text, and sound at the same time.

Figure 2. The difference between unimodal and multimodal AI | Source: Multimodal Learning: Benefits & 3 Real-World Examples in 2024

In this approach, we have several input signals, each of which is processed by a corresponding module. In our case, this includes a separate module for text processing, a separate module for image processing, and a separate module for audio processing. These modules form one single neural network that works from start to finish, which is called end-to-end architecture.

Next, fusion modules are used. They may have different names, but they perform the same function — they combine image, text, and audio features and perform certain operations on them. For example, perhaps they look for the most similar image feature vector to a text feature vector. This is similar to the principle of CLIP architecture, which we’ll talk about later.

CLIP (2021)

CLIP adds an image-text connection to understand the content of the image.

Figure 3. CLIP Illustration | Source: CLIP: Connecting text and images

CLIP is a revolutionary development. The main idea behind CLIP is that it creates a connection between images and texts to better understand the context of the image. CLIP uses two models — TextEncoder and ImageEncoder. Each of these models converts data into a vector format.

CLIP is trained on a dataset consisting of text-image pairs, with each pair containing a text description and a corresponding image. During training, the model tries to find the TextEncoder and ImageEncoder parameters so that the vectors obtained for the text and image are similar to each other. The goal is to have the vectors of other text descriptions be different from the target image vector.

When using CLIP for Zero-Shot Object Detection, we can feed an image and a list of words or phrases related to the objects we want to find in the image. For example, if we have an image of a dog, we can use TextEncoder to create a vector with the text “A photo of a dog”. Then we compare this vector with the vectors obtained for each text in the list of words or phrases. The text with the smallest distance to the image vector indicates the object corresponding to the image.

Thus, we can use CLIP to classify objects in images even without separately training the model on a specific dataset with objects. This approach opens up a wide range of possibilities for applying CLIP in the field of Object Detection, where we can utilize the relationships between texts and images to find objects in images.

OWL-ViT (2022)

OWL-ViT adds image-level patches to understand the location of the objects.

Figure 4. OWL-ViT: Image-level contrastive pre-training | Source: Simple Open-Vocabulary Object Detection with Vision Transformers

In 2022, a new multimodal architecture, OWL-ViT, was introduced for object detection. This network, which is available on the Hugging Face platform, has gained considerable interest in the research and practice community. Let me tell you more about it.

The basic idea is to create embeddings of an image and text, and then compare these embeddings. The image is processed through a Vision Transformer, which generates a set of embeddings. Then, the Vision Transformer applies self-attention and feed-forward networks to these embeddings. Although some of the steps may seem confusing, in practice they help to improve the quality of the model.

Finally, during the training phase, a contrastive loss function is used to encourage corresponding image-text pairs to have similar embeddings and non-corresponding pairs to have distinct embeddings. The model predicts a bounding box and the probability that a certain text embedding applies to a particular object.

It should be noted that the accuracy of object detection may be limited. The authors of the original model used a process of fine-tuning the pre-trained model with object detection datasets using a bipartite matching loss. This process assists in improving the quality of the detected bounding boxes. More information about this process is shown in the diagrams below.

Figure 5. OWL-ViT: Transfer to open-vocabulary detection | Source: Simple Open-Vocabulary Object Detection with Vision Transformers

Now let’s look at an additional feature of this multimodal model. In addition to text, you can use an image as a template. For example, if you have a photo of a butterfly, you can use it as a search query and find similar images. The model is able to analyze both text and images based on common properties.

Figure 6. OWL-ViT: Example of one-shot image-conditioned detection | Source: Simple Open-Vocabulary Object Detection with Vision Transformers

GLIP (2022)

GLIP adds word-level understanding to find the objects by the semantics of the prompt.

Figure 7. GLIP zero-shot transfers to various detection tasks, by writing the categories of interest into a text prompt. | Source: Grounded Language-Image Pre-training

GLIP (2022) goes further by providing insight into images to distinguish their semantics. Let’s illustrate this with an example. Suppose we have a sentence about a woman holding a hair dryer and wearing glasses. At the same time, we see an image showing this woman with a hairdryer and glasses. GLIP reformulates object detection as phrase grounding. By accepting both an image and a text prompt as inputs, it can identify entities such as a person, a hairdryer, and others.

Figure 8. We reformulate detection as a grounding task by aligning each region/box to phrases in a text prompt. We add the cross-modality deep fusion to early fuse information from two modalities and to learn a language-aware visual representation. | Source: Grounded Language-Image Pre-training

This technology offers a new approach to finding objects in an image based on their semantic correspondence with a text prompt. Now, we are not just identifying objects, but also associating parts of the text with components of the image.

Even if you only provide the name of the object, such as “Stingray”, GLIP will be able to find it, but perhaps with low accuracy. However, if you add a description, such as “flat fish”, it will provide additional context and understanding of what you are looking for. It is important to note that “Prompt Engineering” is of great importance when using ChatGPT and modern Zero-Shot Object Detection methods.

Figure 9. A manual prompt tuning example from the Aquarium dataset in ODinW. Given an expressive prompt (“flat and round”), zero-shot GLIP can detect the novel entity “stingray” better. | Source: Grounded Language-Image Pre-training

Segment Anything (2023)

Segment Anything (SAM) adds masks to see the pixel-level location of the objects.

This algorithm, introduced in 2023, allows not only to detect objects in images but also to segment them by applying masks at the pixel level.

Figure 10. Segment Anything | Source: Segment Anything

One of the main features of Segment Anything is its usage of image and prompt encoders to create an overall image embedding, which can be used to segment images based on prompts. These prompts can be spatial, textual, or a combination of both. For instance, you could input “person” as a text prompt, and the algorithm would strive to segment all objects in the image related to a person.

This not only allows you to segment different areas in images but also to understand the layout and content of the scene. Using the segmentation masks produced by the algorithm, one could potentially perform tasks such as counting the number of instances of an object, given the appropriate post-processing steps.

Figure 11. Example images with overlaid masks from our newly introduced dataset, SA-1B. | Source: Segment Anything

Good Visual Tokenizers (2023)

Good Visual Tokenizers (GVT) is a new Multimodal Large Language Model (MLLM) that involves a visual tokenizer, which has been optimized through proper pre-training methods. This tokenizer aids in understanding both the semantic and fine-grained aspects of visual data.

GVT adds usage of the Large Language Model to investigate the image with the text.

Figure 12. Different tasks require visual understanding of different perspectives. Mainstream vision-language tasks, e.g., (a) VQA and (b) Image Captioning mainly focus on semantic understanding of the image. In this work, we also study two fine-grained visual understanding tasks: © Object Counting (OC) and (d) Multi-Class Identification (MCI). | Source: What Makes for Good Visual Tokenizers for Large Language Models?

GVT introduces an optimized visual tokenizer within a Large Language Model, enabling a more comprehensive investigation of images along with the associated text. While the application of these algorithms to specific domains such as medical imagery might require additional research, GVT has already demonstrated superior performance on tasks involving visual question answering, image captioning, and fine-grained visual understanding tasks such as object counting and multi-class identification.

Figure 13. Framework of GVT. First distill the features of a pretrained CLIP via smoothed L1 loss. Then, use it to encode images into a set of tokens, which are fed into the Perceiver Resampler as soft prompts. Together with language instructions, these prompts are fed into LLM to generate responses. Only the Perceiver Resampler is optimized in this process. | Source: What Makes for Good Visual Tokenizers for Large Language Models?

Integrating text and images into one model allows you to expand your data understanding and processing capabilities. By using algorithms like the ones above, significant advances can be made in solving a variety of tasks that previously required complex algorithms and large amounts of data.

To sum up Zero-Shot Object Detection:

  1. CLIP adds an image-text connection to understand the content of the image.
  2. OWL-ViT adds image-level patches to understand the location of the objects.
  3. GLIP adds word-level understanding to find the objects by the semantics of the prompt.
  4. SAM adds masks to see the pixel-level location of the objects.
  5. GVT adds usage of the Large Language Model to investigate the image with the text.

Conclusion

In conclusion, the evolution of object detection algorithms has been a remarkable journey, from the early days of computer vision to the current state-of-the-art deep learning techniques. We’ve transitioned from traditional methods like Viola-Jones Detectors and HOG Detectors to more advanced approaches such as RCNN, YOLO, SSD, and CenterNet, which introduced end-to-end architectures for improved adaptability. However, the most groundbreaking leap came with Zero-Shot object detection methods like OWL-ViT, GLIP, Segment Anything, and GVT. These innovative techniques enable us to detect objects in images without the need for extensive neural network training, opening up a new era of possibilities in the field of object detection.

Thank you for taking the time to read this article. If you found it informative and engaging, feel free to connect with me through my social media channels.

If you have any questions or feedback, please feel free to leave a comment below or contact me directly via any of my communication channels.

I look forward to sharing more insights and knowledge with you in the future!

--

--

Andrii Polukhin

I am a deep learning enthusiast. Currently, I am a ML Engineer at Data Science UA and Samba TV. Writing about neural networks and artificial intelligence.