CLIP-as-service powered by Jina AI
Now you can create SOTA vector embeddings for text and images using a simple API call, without worrying about the background implementation details.
Introduction
CLIP-as-service is an out-of-the-box solution by Jina AI to generate text and image embeddings on-the-fly using the CLIP model. It is an API-based service that sends the input data (i.e text and images) and gets the output as fixed-length vector embeddings. It is built on an intuitive client-server architecture making it easy to use without any learning curve or pre-requisites.
What is CLIP?
CLIP(Contrastive Language–Image Pre-training) is one of the most efficient ways to connect text and images in the form of embeddings. CLIP learns the visual concepts using natural language supervision to generate state-of-the-art embeddings.
In simple language, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images. It is built on the work of zero-shot transfer, natural language supervision, and multimodal learning (i.e combining different data types together).
CLIP v/s Conventional Vision Models
CLIP can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and GPT-3. It acts as a bridge between computer vision and natural language processing.
CLIP is able to distinguish between different images, even if it has never seen those images before because it has a general understanding of the phrases that represent them. It can just as easily distinguish between an image of a “cat” and a “dog” as it can between “an illustration of Deadpool pretending to be a bunny rabbit” and “an underwater scene in the style of Vincent Van Gogh” (even though it has definitely never seen those things in its training data).
CLIP-as-service
CLIP-as-service is a low-latency high-scalability service for embedding images and texts. It can be easily integrated as a microservice into neural search solutions. It has the following features:
- Fast: Serving CLIP models with ONNX runtime and PyTorch is super fast and designed for large data and long-running tasks.
- Elastic: It lets you horizontally scale up and down multiple CLIP models on a single GPU, with automatic load balancing.
- Easy-to-use: It does not have any learning curve and has a minimalist design on the client and server sides. It hosts a simple-to-use API for image and sentence embedding.
- Integration: It integrates smoothly with Jina AI’s neural search ecosystem including Jina and DocArray. These integrations let you build cross-modal and multi-modal search solutions in no time.
Install CLIP-as-service
You can install the CLIP client and server independently via pip. The only requirement is to have Python 3.7+
installed on your system:
- To install the CLIP server, you can run the following command:
pip install clip-server
- To install the CLIP client, you can run the following command:
pip install clip-client
Set up CLIP Server
To start you have to set up a CLIP server that will download the model and host it on a particular IP address. Once the server is ready, you can use the client to make requests and get the results.
To start the server, run the following command:
python -m clip_server
When the server is up, it will show the following output:
🔗 Protocol GRPC
🏠 Local access 0.0.0.0:51000
🔒 Private network 192.168.3.62:51000
🌐 Public address 87.191.159.105:51000
Connect from Client
Once the server is up and running, you can use a GRPC client to connect with it and make the requests. Depending on the location of the client and server, you can use different IP addresses. For more information check the C-a-S documentation.
To verify the connection between client and server, you can run a short python script as follows:
If the connection is proper you will get the following response:
Roundtrip 16ms 100%
├── Client-server network 12ms 75%
└── Server 4ms 25%
├── Gateway-CLIP network 0ms 0%
└── CLIP model 4ms 100%
Minimal Working Example
In this example, we will build a simple text-to-image search using CLIP-as-service. A user can input a sentence and get the matching images as the result. We’ll use the Totally Looks Like dataset and DocArray package from Jina AI to build the entire search solution.
Note: DocArray is included within
clip-client
as an upstream dependency, so you don't need to install it separately.
First, we will load the images which you can simply pull from the Jina Cloud:
The TTL dataset contains 12,032 images, so it may take a while to pull. Once done, you can visualize it using the in-built functionality of DocArray da.plot_image_sprites()
which will produce the following image block:
Next thing is to encode the images by starting the clip server with the command python -m clip_server
For instance, we start the server at 0.0.0.0:51000
with GRPC
protocol (you will get this information after running the server). To encode the images, you can use the following python client script:
After encoding the images, it’s time to test the power of CLIP and see the search results in action:
It will produce the following result:
CLIP-as-service in Action!
To give you a glimpse of the potential capabilities of CLIP-as-service and how you can leverage it to create state-of-the-art search engines with just one line of code. We created a simple Text-to-Image search example and an Image-to-Text search example using the data from Pride and Prejudice!
To get started, you can first run the C-a-S server that will be hosted locally and will be accessible by the client. Follow this notebook to run the C-a-S server 👉
Once the server is up and running, you can use the client to make the requests to that and get the results. Follow this notebook to use the C-a-S client for building multimodal search examples 👉
Learning Resources
Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆
Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋