Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP

Aryamaan Verma
6 min readMar 29, 2024

--

Aryamaan Verma(200040029) | Utkarsh Chittora(20d070085)

Photo by Joshua Sortino on Unsplash

Hi folks, in this blog we will address the problem of zero-shot out-of-distribution (OOD) detection, where the goal is to detect samples that do not belong to any of the known classes, without prior training data of the unseen classes, using the pre-trained model CLIP and a novel method called ZOC.

Table of Content

Introduction

Background and Related Work

Methodology

Key Differences

Experiments Conducted

Results and Discussions

Conclusions

References

Introduction

In the context of OOD detection, the primary objective is to identify samples that do not belong to any of the known classes, without prior training data of the unseen classes. ZOC leverages recent advancements in zero-shot classification through multi-modal representation learning, specifically extending the pre-trained language-vision model CLIP by training a text-based image description generator on top of CLIP.

In testing, the extended model generates candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot OOD detection.

By leveraging the rich feature space shared by both image and text data in CLIP, ZOC demonstrates superior OOD detection performance, offering a promising solution for enhancing the reliability of OOD sample detection in real-world applications.

Background and Related Work

The background and related work in the field of zero-shot out-of-distribution (OOD) detection have seen significant advancements, with various methods and techniques being developed to address this challenging problem. One notable approach is the use of the pre-trained language-vision model CLIP, which has been leveraged for zero-shot image classification.

Recent research has focused on extending the CLIP model to work in the OOD setting, aiming to dynamically generate candidate unseen labels for each test image and define a novel confidence score based on the similarity of the test image to seen and generated candidate unseen labels in the feature space.

In addition to the advancements in leveraging CLIP for zero-shot OOD detection, previous work in the field has explored various techniques and methodologies for OOD detection. These include methods such as ODIST, which focuses on open-world classification via distributionally shifted instances, and Doc, a deep open classification of text documents.

The diverse range of methods and techniques in the field of OOD detection reflects the ongoing efforts to develop robust and effective approaches for identifying samples that do not belong to any of the known classes, without prior training data of the unseen classes.

Methodology

fig 1 — model outline

In Zoc, the model generates candidate unknown class names for each test sample and computes a confidence score based on both the known class names and candidate unknown class names for zero-shot OOD detection.

The methodology includes an experimental evaluation on five benchmark datasets for OOD detection to assess the performance of the proposed ZOC method. The datasets used for evaluation include CIFAR10, CIFAR+10, CIFAR+50, TinyImagenet, and CIFAR100. The evaluation measures the Area Under the ROC curve (AUROC) as the primary evaluation metric, which is commonly used for OOD detection and the results demonstrate that ZOC outperforms the baselines by a large margin, indicating its superior performance in zero-shot OOD detection.

Additionally, the methodology involves a comprehensive comparison of the proposed ZOC method with 11 OOD detection baselines, each of which either requires training a closed-world classifier or works based on a pre-trained model as its backbone. The comparison aims to highlight the effectiveness of ZOC in zero-shot OOD detection.

Key Differences

fig 2 — values for different datasets

The key differences between the Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP (ZOC) method and traditional methods for out-of-distribution (OOD) detection are as follows:

  • Training Approach: ZOC does not require specific training data for OOD detection, unlike traditional methods that rely on training a closed-world classifier on seen classes using labeled training data.
  • Candidate Label Generation: ZOC dynamically generates candidate unknown class names for each test sample, while traditional methods either require a set of candidate labels to represent possible OOD labels or have prior knowledge about unseen classes for detection.
  • Dependency on Closed-World Classifier: ZOC does not need to train a closed-world classifier on seen classes or have prior knowledge about unseen classes for detection, whereas traditional methods rely on a closed-world classifier for OOD detection.

Experiments Conducted

fig 3 — approach

The experiments conducted in the study encompassed a comprehensive evaluation of the proposed Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP (ZOC) method. The evaluation involved testing the method on various benchmark datasets, including CIFAR10, CIFAR100, CIFAR+10, CIFAR+50, and TinyImagenet, to assess its effectiveness in out-of-distribution (OOD) detection. The difficulty level of the OOD detection task was measured using the openness metric, which indicates the complexity of the task based on the number of unseen classes presented to the model at test time. The evaluation measure used for OOD detection was the Area Under the ROC curve (AUROC), a commonly used metric for assessing detection performance.

Results and Discussions

fig 4 — results

A notable difference between ZOC and the baselines is that ZOC’s inference is based on dynamically generated candidate unseen labels for each sample, which contributes to its enhanced detection capability.

Additionally, the study compared ZOC with other existing methods, highlighting its effectiveness in detecting OOD samples. The results confirmed that ZOC is superior to traditional supervised models and baselines using pre-trained CLIP backbone as their encoders.

The discussion also included an error study of generated labels and histograms of confidence scores, providing insights into the performance and potential areas for improvement in future work.

Conclusions

In conclusion, the paper introduces the novel task of zero-shot out-of-distribution (OOD) detection, leveraging recent advancements in zero-shot closed-world classification using the pre-trained CLIP model. The proposed Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP (ZOC) method extends the capabilities of the CLIP model by dynamically generating candidate unseen labels for each test image and defining a novel confidence score based on the similarity of the test image to both seen and generated candidate unseen labels in the feature space.

Experimental results confirm the superiority of the ZOC system over traditional supervised models and baselines using pre-trained CLIP backbone as their encoders. The study demonstrates that ZOC outperforms existing methods, showcasing its effectiveness in zero-shot OOD detection.

Referrences

  1. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  2. Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008–7024.
  3. Scheirer, W. J.; de Rezende Rocha, A.; Sapkota, A.; and Boult, T. E. 2012. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7): 1757–1772.
  4. Scheirer, W. J.; Jain, L. P.; and Boult, T. E. 2014. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence, 36(11): 2317–2324.
  5. Shu, L.; Benajiba, Y.; Mansour, S.; and Zhang, Y. 2021. ODIST: Open World Classification via Distributionally Shifted Instances. In Findings of the Association for Computational Linguistics: EMNLP 2021, 3751–3756.
  6. Shu, L.; Xu, H.; and Liu, B. 2017. Doc: Deep open classification of text documents. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017.
  7. Shu, L.; Xu, H.; and Liu, B. 2018. Unseen class discovery in open-world classification. arXiv preprint arXiv:1801.05609.

--

--