CLIP-as-service powered by Jina AI

Now you can create SOTA vector embeddings for text and images using a simple API call, without worrying about the background implementation details.

Shubham Saboo
Jina AI

--

Generate SOTA text and image embedding with just one line of code!

Introduction

CLIP-as-service is an out-of-the-box solution by Jina AI to generate text and image embeddings on-the-fly using the CLIP model. It is an API-based service that sends the input data (i.e text and images) and gets the output as fixed-length vector embeddings. It is built on an intuitive client-server architecture making it easy to use without any learning curve or pre-requisites.

What is CLIP?

CLIP(Contrastive Language–Image Pre-training) is one of the most efficient ways to connect text and images in the form of embeddings. CLIP learns the visual concepts using natural language supervision to generate state-of-the-art embeddings.

In simple language, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It is built on the work of zero-shot transfer, natural language supervision, and multimodal learning (i.e combining different data types together).

CLIP v/s Conventional Vision Models

CLIP can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and GPT-3. It acts as a bridge between computer vision and natural language processing.

CLIP is able to distinguish between different images, even if it has never seen those images before because it has a general understanding of the phrases that represent them. It can just as easily distinguish between an image of a “cat” and a “dog” as it can between “an illustration of Deadpool pretending to be a bunny rabbit” and “an underwater scene in the style of Vincent Van Gogh” (even though it has definitely never seen those things in its training data).

CLIP-as-service

CLIP-as-service is a low-latency high-scalability service for embedding images and texts. It can be easily integrated as a microservice into neural search solutions. It has the following features:

  • Fast: Serving CLIP models with ONNX runtime and PyTorch is super fast and designed for large data and long-running tasks.
  • Elastic: It lets you horizontally scale up and down multiple CLIP models on a single GPU, with automatic load balancing.
  • Easy-to-use: It does not have any learning curve and has a minimalist design on the client and server sides. It hosts a simple-to-use API for image and sentence embedding.
  • Integration: It integrates smoothly with Jina AI’s neural search ecosystem including Jina and DocArray. These integrations let you build cross-modal and multi-modal search solutions in no time.

Install CLIP-as-service

You can install the CLIP client and server independently via pip. The only requirement is to have Python 3.7+ installed on your system:

  • To install the CLIP server, you can run the following command:
pip install clip-server
  • To install the CLIP client, you can run the following command:
pip install clip-client

Set up CLIP Server

To start you have to set up a CLIP server that will download the model and host it on a particular IP address. Once the server is ready, you can use the client to make requests and get the results.

To start the server, run the following command:

python -m clip_server

When the server is up, it will show the following output:

 🔗         Protocol                  GRPC   
🏠 Local access 0.0.0.0:51000
🔒 Private network 192.168.3.62:51000
🌐 Public address 87.191.159.105:51000

Connect from Client

Once the server is up and running, you can use a GRPC client to connect with it and make the requests. Depending on the location of the client and server, you can use different IP addresses. For more information check the C-a-S documentation.

To verify the connection between client and server, you can run a short python script as follows:

If the connection is proper you will get the following response:

Roundtrip  16ms  100%                                                          
├── Client-server network 12ms 75%
└── Server 4ms 25%
├── Gateway-CLIP network 0ms 0%
└── CLIP model 4ms 100%

Minimal Working Example

In this example, we will build a simple text-to-image search using CLIP-as-service. A user can input a sentence and get the matching images as the result. We’ll use the Totally Looks Like dataset and DocArray package from Jina AI to build the entire search solution.

Note: DocArray is included within clip-client as an upstream dependency, so you don't need to install it separately.

First, we will load the images which you can simply pull from the Jina Cloud:

The TTL dataset contains 12,032 images, so it may take a while to pull. Once done, you can visualize it using the in-built functionality of DocArray da.plot_image_sprites() which will produce the following image block:

Next thing is to encode the images by starting the clip server with the command python -m clip_server For instance, we start the server at 0.0.0.0:51000 with GRPC protocol (you will get this information after running the server). To encode the images, you can use the following python client script:

After encoding the images, it’s time to test the power of CLIP and see the search results in action:

It will produce the following result:

Search result for the query → “A happy potato”

CLIP-as-service in Action!

To give you a glimpse of the potential capabilities of CLIP-as-service and how you can leverage it to create state-of-the-art search engines with just one line of code. We created a simple Text-to-Image search example and an Image-to-Text search example using the data from Pride and Prejudice!

To get started, you can first run the C-a-S server that will be hosted locally and will be accessible by the client. Follow this notebook to run the C-a-S server 👉

Once the server is up and running, you can use the client to make the requests to that and get the results. Follow this notebook to use the C-a-S client for building multimodal search examples 👉

Learning Resources

Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆

Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋

--

--