Jina Tips

Executors inside Executors. Oh my!

Take an existing Executor and just make it do a bit more

Team Jina
Jina AI

--

I’ve recently been building a PDF search engine using Jina, and started facing some challenges when I wanted to perform more complex operations. Here’s how I solved it by going full Executor-ception on the problem.

So, what’s the problem?

Given a PDF, I want to break it down into images and sentences and store those as chunks:

What I want

But that’s what PDFSegmenter is for, right? Wrong! PDFSegmenter breaks PDFs down into images and whatever blobs of text it can find. Those blobs can be rather long, and are not necessarily sentences.

What PDFSegmenter gives me

How do we solve this?

We’ll start with a Flow that does one thing — uses PDFSegmenter to break our PDF into images and big text blobs:

Now we’ll add a new Executor after that, to process each segmented PDF’s chunks:

Our new Executor will:

  • Look at each chunk (let’s call them level-1 chunks) and ignore it if it’s an image
  • If it’s a text chunk, load up the Sentencizer Executor from Jina Hub to break it into sentences (we’ll call them level-2 chunks)
  • Copy the level-2 chunks over to level 1

Here’s the full code:

Why not do this directly in a Flow?

A Flow is great if you just need data to flow directly and don’t need to do stuff like check chunk types. If we had just run this as:

Then Sentencizer would have choked on the images since it expects only text input.

What about you?

Have you built anything with Executors inside Executors (inside Executors, inside …)? Let us know via our Slack community!

--

--

Team Jina
Jina AI

An easier way to build neural search in the cloud