Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

Member-only story

CLIP: The Multimodal Powerhouse Transforming Computer Vision

--

When I first met with CLIP and its performance, it impressed me so much that whenever I start a new project, I take CLIP’s performance as a baseline and then see some other state-of-the-art models to see if they can beat it to decide which model to go for that specific project.

Let’s start with the important terms of new-generation AI models:

Multimodal AI Models

A multimodal refers to any AI model having at least one of the following properties:

  1. Inputs are of different modalities (the system can process both image and text)
  2. Outputs are of different modalities (the system can generate both image and text)

As a computer vision engineer, I will focus on the “multimodal” term in this field and I will be examining models having text and image multimodalities in this blog.

Traditional computer vision models use images as input data types and extract the features from them to understand what should a cat look like (if the cat is in the class list of our dataset let’s say), without having any idea about what a cat means contextually. Is that an animal or an object? The model doesn’t have any idea! What it learns is to check the features of the input image and decide if they match class 1, class 2, or class x…

A multimodal computer vision model instead, learns the classes by using both text and images, by understanding the

--

--

Responses (1)