Engineering Insights

The next level of multi-modality in DocArray and Jina

Johannes Messner
Jina AI
Published in
7 min readJul 29, 2022


At Jina AI we always aim to build the best possible tools for Neural Search and other multimodal applications, such as generative art.

Over the past few weeks we have been chiseling away at new features and API’s that make working with any data — your data, whatever that may be — an absolute breeze.

Crucially, we have put our focus on making all of this work locally (in DocArray) as well as in the cloud (in Jina).

Overall, this represents a big step-change in the way you interact with DocArray and Jina. Or rather, in how they can adapt to you: You will be able to define your own idioms for your data, and reason purely in terms that are already familiar in your domain. Less thinking about DocArray and its peculiarities, more thinking about your data and your task.

Along the way we had to make a number of nuanced design decisions, until we finally reached our fleshed-out philosophy.

So let’s take this opportunity to go over these decisions, why we made them, and why we think that you will love them.

How things used to be in the olden days

Document and DocumentArray have always been extremely flexible data structures, putting up with essentially whatever kind of data you can throw at them.

But for a long time we took the same approach that most software packages take: We implemented all the features that we and the community needed, around these features we designed an API that we thought made sense, and then released our package to the world, happy hacking everyone!

Essentially, we gave you all the tools that you needed, but we also told you how you had to interact with those tools, and how to make your data fit them.

Following this approach, you could, for example, represent a scientific paper that contains an image, a description of that image, and a number of references, as well as a main body of text, and some metadata.
You would model that like this:

You would model your paper as a nested Document, with the main data at the top level, and all other data at the chunk level, each chunk being its own Document and holding its own data.

This would give you an overall structure like this:

Then you could, of course, access the data that you just compiled:

Ouch 😅 Suddenly you have to reason about chunks, a concept these people at Jina AI came up with, and chunks of chunks, and indices of chunks, when all you wanted to do was access your image and your reference!

Clearly, that needed to change.

Your data at the center

At Jina AI we still love our classic Document, our DocumentArray, and the API they provide, but here’s the thing: You know your data best, and concepts like chunks may not naturally map to your task at hand.

So why don’t we let you define your own API?
Why don’t we adapt to you instead of you adapting to us?

Enter, Jina dataclasses!

Let’s re-model the article from above, but now let’s do it in a way that actually resembles the way a human thinks about the world:

We achieved the same result, and all we had to do was define the structure of our data, and then fill it.

Here the dataclass serves as a mapping from the real world, and your domain, to the world of DocArray. You can almost see it as a very fancy __init__() method.

So far so good, but what happens when we try to access data from article?
That is where the really cool part begins, and what our newest feature brings to the table. Let’s see:

Let’s also not forget about the DocumentArray level syntax for accessing custom modalities:

As you can see, you maintain the ability to reason about your data, and sub-data, in terms of the modalities that you defined — even after converting the dataclass to a Document or DocumentArray.

Gone are the chunks, instead you can directly access image, reference, etc., the actual data that is relevant in your domain.

Documents everywhere

One thing you may have have stumbled on in the last code snippet is:
Why do I need to call article.image.tensor in order to get my image tensor? Shouldn’t a call to article.image be enough? Why the intermediary step through a Document?

And you’d be right! In this simple example the latter would indeed be the more elegant interface.

Nonetheless, there are three important considerations that pushed us to make calls like article.image return a fully-fledged Document instead of the data stored inside of it.

  1. Flexibility: Because DocArray is a data structure for any kind of data, flexibility is always one of our top priorities. So instead of returning a specific data type we return a Document, the most flexible data representation that we know of. Then you can do whatever esoteric thing you want to do with it, we don’t judge!
  2. Documents everywhere: Nearly every operation in DocArray and Jina takes a Document (or DocumentArray) as input, and returns one as output. We like to keep our stack surprise-free, so you can keep assuming that Documents are what you’re gonna get.
  3. Support in Jina (Executors): Once the serious work begins, the power of Document as a return type becomes absolutely crucial. Read on to see why!

Multi-modality to the cloud

So far we’ve only talked about local development using DocArray. Now let’s shift our focus to the microservice world of Jina.

First things first: You’ll be relieved to know that accessing a Document modality does not depend on your local environment and is absolutely possible inside of an Executor, no matter if it is running on your laptop, a Kubernetes cluster across the globe, or in JCloud:

But that’s just the beginning — let’s see how native multi-modality can be useful in a more realistic example.

Multi-modality in action

Let’s look at a practical example of a multi-modal app: We again have a very simple Article, consisting of an image and a description, and we want to create vector embeddings of such articles.

To do that, we want to encode the image and the description using respective embedding models, in order to then combine these representations into a final embedding vector for the entire article.

Here we give an example of what that might look like:

Here the beauty of Documents everywhere as a principle really starts to shine through.

Since d.image and d.description return fully-fledged Documents, we can store the corresponding embeddings right there, and the EmbeddingCombiner has an easy time finding them, using them, and storing the overall embedding at the top level.

Maintaining interoperability

If you have made it this far, you are hopefully as excited as we are about this new way of interacting with Documents.

But you probably also have a concern: Surely these custom-defined modalities are something super personalized? How can you share your Executors with the community, and in turn, use Executors from Jina Hub?

Fear not, that’s not a problem at all!

The main idea to get around this concern is the following: instead of using the Document-level selector syntax (d.image), use the DocumentArray level syntax (da['@.[image]']) to maintain maximal interoperability.

To make that more concrete, let’s refactor the Executor code from above:

After this rewrite, every client connecting to these Executors can provide its own parameters with access_paths that match their custom-defined modalities:

This way, every user can define their own modalities, and reason purely in terms of those — while Executors remain general enough to be re-used across the entire ecosystem.

What’s next

By now we have hopefully convinced you of our new design direction, but the features covered in this blog aren’t everything that you can do.

  • Custom modality types: DocArray provides a number of common types for the dataclass interface, such as Text, Image, JSON, and many more. But you can also define and use your own types, including the definition of custom mappings from the dataclass to Document, and back.
  • Nested dataclasses: Some complex domains require more complex models. For example, an article might actually consist of multiple paragraphs, where each paragraph contains an image, a description, and a main_text. You can represent this easily, by nesting a list of paragraph dataclasses inside your article dataclass.
  • Secondary indices (coming soon): In the encoding example above we used EmbeddingCombiner to generate top-level embeddings for each Document, ready to use for a neural search application. But for some tasks you might want to perform search not on the top level, but on the modality level: Maybe instead of finding overall similar articles, you want to find articles that just share similar images. That is what secondary indices in our Document stores will soon be able to do!



Johannes Messner
Jina AI
Writer for

ML Engineer @ Jina AI