Member-only story
Fine-tuning Multimodal Embedding Models
Adapting CLIP to YouTube Data (with Python Code)
This is the 4th article in a larger series on multimodal AI. In the previous post, we discussed multimodal RAG systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a system using CLIP. One issue with this approach, however, is that vector search results from a general-purpose embedding model (like CLIP) may perform poorly in domain-specific use cases. In this article, I’ll discuss how we can mitigate these issues via fine-tuning multimodal embedding models.
Multimodal embeddings represent multiple data modalities in the same vector space such that similar concepts are co-located. A visual example of this is shown below, where semantically similar items (e.g. a picture of a dog and its corresponding caption) are close, while dissimilar items (e.g. a picture of a cat and a caption describing a dog) are far apart.