[Hands-On] Prompt-based Image Classification with CLIP

Hugman Sangkeun Jung
13 min readJul 13, 2024

(You can find the Korean version of the post at this link.)

In the previous post, we looked at a prompt-based approach for text classification.

In this post, we’ll extend that concept and apply it to image classification tasks. Specifically, we’ll implement actual code to perform prompt-based image classification using OpenAI’s CLIP (Contrastive Language–Image Pre-training) model and analyze the results.

What is Prompt-based Image Classification?

Prompt-based image classification is an innovative method of classifying images using vision-language models such as CLIP (Contrastive Language-Image Pre-training). These models are trained on large-scale image-text pair data, enabling them to understand deep associations between visual information and linguistic descriptions.

Starting with CLIP, models like ALIGN (Aligning text and images), DALL-E, Imagen, and Stable Diffusion have emerged, greatly enhancing the ability to comprehensively understand images and text. These models can go beyond simply classifying images to explain image content in natural language or generate images based on textual descriptions.

--

--

Hugman Sangkeun Jung

Hugman Sangkeun Jung is a professor at Chungnam National University, with expertise in AI, machine learning, NLP, and medical decision support.