How to

Search nested data with Python, DocArray and subindices

Multi/cross-modal queries made easier

Johannes Messner
Jina AI

--

Note: this article is better viewed in notebook format. We recommend go there so you can run the code in your browser!

Vector databases are great. They allow you to retrieve embedding vectors quickly and efficiently based on similarity, and thus form a key building block of many neural search applications.

But vector databases deal in, well, vectors, whereas you usually want to think about your data. When your data is simple, these two things are largely equivalent: Each vector represents one data point, and every data point is associated with one vector. But the real world is messy, and that’s where this isomorphism breaks down.

Luckily, there is a solution for that!

The task: product search

In this notebook we will work through one such example: Our database consists of listings of an online store. Each listing, in turn, contains multiple images and a product description.

What we want to make this data searchable. Further, we want the user of our little search app to be able to use different modalities as their query input: They can search by text, by image, or by both at the same time.

To solve this problem, we need just one tool: DocArray.

Specifically we will heavily leverage three DocArray features:

  1. Multimodal documents, to model our data
  2. Subindices, to make our data points, and parts of those data points, searchable
  3. Document Stores, to store our data on disk (and efficiently retrieve it from there)

We will so use CLIP-as-Service to create embeddings for our data, but you could use your favourite image and text embedding models instead.

Data indexing

Before we actually index our data, let’s take a look at what our data will have to look like to solve the task above.

  • To search through the listing descriptions, each listing needs semantic embedding that represents its description
  • To search through listing images, each listing needs an embedding for every image in that listing
  • To search through the listings as a whole, each listing needs an embedding that represents the listing in its entirety

So we’ll be dealing with something like this:

This kind of nested data structure can usually not be preserved when storing the embeddings in a vector database.

However, DocArray subindices allow us to do just that.

Subindices explained

In DocArray, data is stored in the form of Documents that are organized in a DocumentArray. By default, a DocumentArray is an in-memory data structure, but it also natively supports Document Stores, which are (vector) database backends that can be used to persist data on disk.

Every DocumentArray represents one search index, so given a query, we can find elements contained in it. Once matching Documents are found, they can be loaded into memory.

Subindices allow this pattern to extend to nested data. Each subindex represents one nested level of the parent DocumentArray, such as image or description. If subindices are enabled, you can perform a search directly on that level, without loading all of your data into memory first - just like you do with the root level index.

Under the hood, each subindex creates a separate database index that stores the associated data independently from the other subindices or the root index.

See the code in action

The rest of this post is better viewed in a notebook, since we’re getting down and dirty with the code:

--

--