CLIP-as-service powered by Jina AI

Now you can create SOTA vector embeddings for text and images using a simple API call, without worrying about the background implementation details.

Shubham Saboo

Published in

Jina AI

5 min readMar 29, 2022

Generate SOTA text and image embedding with just one line of code!

Introduction

CLIP-as-service is an out-of-the-box solution by Jina AI to generate text and image embeddings on-the-fly using the CLIP model. It is an API-based service that sends the input data (i.e text and images) and gets the output as fixed-length vector embeddings. It is built on an intuitive client-server architecture making it easy to use without any learning curve or pre-requisites.

What is CLIP?

CLIP(Contrastive Language–Image Pre-training) is one of the most efficient ways to connect text and images in the form of embeddings. CLIP learns the visual concepts using natural language supervision to generate state-of-the-art embeddings.

In simple language, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It is built on the work of zero-shot transfer, natural language supervision, and multimodal learning (i.e combining different data types together).

CLIP v/s Conventional Vision Models

CLIP can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and GPT-3. It acts as a bridge between computer vision and natural language processing.

CLIP is able to distinguish between different images, even if it has never seen those images before because it has a general understanding of the phrases that represent them. It can just as easily distinguish between an image of a “cat” and a “dog” as it can between “an illustration of Deadpool pretending to be a bunny rabbit” and “an underwater scene in the style of Vincent Van Gogh” (even though it has definitely never seen those things in its training data).

CLIP-as-service

CLIP-as-service is a low-latency high-scalability service for embedding images and texts. It can be easily integrated as a microservice into neural search solutions. It has the following features:

Fast: Serving CLIP models with ONNX runtime and PyTorch is super fast and designed for large data and long-running tasks.
Elastic: It lets you horizontally scale up and down multiple CLIP models on a single GPU, with automatic load balancing.
Easy-to-use: It does not have any learning curve and has a minimalist design on the client and server sides. It hosts a simple-to-use API for image and sentence embedding.
Integration: It integrates smoothly with Jina AI’s neural search ecosystem including Jina and DocArray. These integrations let you build cross-modal and multi-modal search solutions in no time.

Install CLIP-as-service

You can install the CLIP client and server independently via pip. The only requirement is to have Python 3.7+ installed on your system:

To install the CLIP server, you can run the following command:

pip install clip-server

To install the CLIP client, you can run the following command:

pip install clip-client

Set up CLIP Server

To start you have to set up a CLIP server that will download the model and host it on a particular IP address. Once the server is ready, you can use the client to make requests and get the results.

To start the server, run the following command:

python -m clip_server

When the server is up, it will show the following output:

 🔗         Protocol                  GRPC   
 🏠     Local access         0.0.0.0:51000   
 🔒  Private network    192.168.3.62:51000   
 🌐   Public address  87.191.159.105:51000

Connect from Client

Once the server is up and running, you can use a GRPC client to connect with it and make the requests. Depending on the location of the client and server, you can use different IP addresses. For more information check the C-a-S documentation.

To verify the connection between client and server, you can run a short python script as follows:

If the connection is proper you will get the following response:

Roundtrip  16ms  100%                                                          
├──  Client-server network  12ms  75%                                           
└──  Server  4ms  25%                                                           
    ├──  Gateway-CLIP network  0ms  0%                                          
    └──  CLIP model  4ms  100%

Minimal Working Example

In this example, we will build a simple text-to-image search using CLIP-as-service. A user can input a sentence and get the matching images as the result. We’ll use the Totally Looks Like dataset and DocArray package from Jina AI to build the entire search solution.

Note: DocArray is included within clip-client as an upstream dependency, so you don't need to install it separately.

First, we will load the images which you can simply pull from the Jina Cloud:

The TTL dataset contains 12,032 images, so it may take a while to pull. Once done, you can visualize it using the in-built functionality of DocArray da.plot_image_sprites() which will produce the following image block:

Next thing is to encode the images by starting the clip server with the command python -m clip_server For instance, we start the server at 0.0.0.0:51000 with GRPC protocol (you will get this information after running the server). To encode the images, you can use the following python client script:

After encoding the images, it’s time to test the power of CLIP and see the search results in action:

It will produce the following result:

Search result for the query → “A happy potato”

CLIP-as-service in Action!

To give you a glimpse of the potential capabilities of CLIP-as-service and how you can leverage it to create state-of-the-art search engines with just one line of code. We created a simple Text-to-Image search example and an Image-to-Text search example using the data from Pride and Prejudice!

To get started, you can first run the C-a-S server that will be hosted locally and will be accessible by the client. Follow this notebook to run the C-a-S server 👉

CLIP-as-service/server .py

Notebooks for docarray, Jina, Finetuner, and other products from Jina AI - neural-search-notebooks/server.ipynb at main…

github.com

Once the server is up and running, you can use the client to make the requests to that and get the results. Follow this notebook to use the C-a-S client for building multimodal search examples 👉

CLIP-as-service/client.py

Notebooks for docarray, Jina, Finetuner, and other products from Jina AI - neural-search-notebooks/client.ipynb at main…