How to
Search PDF text, images and tables with Python & CLIP
Build a PDF search engine in your browser with Jina, Hub, and DocArray
In previous blog posts (1, 2, 3) and notebooks, we built a basic PDF search engine. Since then I’ve heard from a lot of people that they want to go further, and search images and tables too. As if I haven’t suffered enough with PDFs…
So, in this blog post and notebook, we’re going to look at how to do just that. We’ll build a notebook that will:
- Let you index and search through PDF text, tables and images, using text or image as search term
- Filter results by type (text, image, table)
We won’t go through every little bit of code here — check the notebook for that. But we’ll go over the high-level stuff to give you a good big-picture understanding.
Problem description
PDF is a gnarly format — one PDF file can contain text, images, tables, audio, video, 3D meshes, and lots of other ridiculous things. Extracting even a subset of that information can be a lot of work.
If you can extract information, you can do all sorts of things, like searching through it, summarizing it, collating it, and so on. But if you can’t extract anything you’re just left with big blobs of almost meaningless data. Therefore being able to process and extract this data is vital for business intelligence. Or even if you just want to remix your paleontology textbook collection.
We’re going to focus on just extracting text, images and tables. Any more than that and we’d be here forever.
Our development environment
As usual we’ll use a Google Colab notebook.
Be sure to have your GPU enabled so the CLIP encoder can generate embeddings quickly.
Our dataset
Our dataset is just a couple of scientific papers pulled from arxiv.org. I made sure to use ones with images, text, and tables. We’re just using two of them for now:
- Late Ordovician geographic patterns of extinction compared with simulations of astrophysical ionizing radiation damage
- Trilobite “pelotons”: Possible hydrodynamic drag effects between leading and following trilobites in trilobite queues
Why yes, I have been very into paleontology lately…
For reference, this cutie-pie is a trilobite. It’s an extinct animal, so don’t expect to find one in your garden:
We’ll come back to our trilobite friend when we start searching through our dataset.
Our tech stack
We’ll use:
- DocArray — to load and process our PDF Documents and search term Document.
- Jina — to build a pipeline that will process each PDF (or search term entered by the user).
- Jina Hub — to use pre-built components in the pipeline, instead of hand-coding everything from scratch.
Our model
We’ll be using the CLIP (Contrastive Language-Image Pretraining) model from OpenAI, since that lets us create vector embeddings in the same embedding space for both images and text.
We’ll use the version from Jina Hub.
Our Flow
Every Jina project has a Flow — the pipeline in which all of our Documents are processed.
In prior examples we’ve used two Flows — one for indexing and one for searching. This time we’ll only use one Flow, since that’s a lot easier to work with when we host with JCloud (which we’ll do in a future notebook).
We’ll use a different name prefix for each operation (indexing or searching), namely index_
, search_
or all_
.
This Flow will be a lot bigger than our Flows from prior examples, since we’re really going all out on functionality.
Our Executors
Each Executor in our Flow performs a single task on our PDF or search term. All of our Executors come from Jina Hub.
Note: I wrote (or modified) several of these Executors myself specifically for this notebook. So those Executors in particular are not widely tested.
To start, we can consider a single top-level PDF Document and see how it changes during the Flow:
- PDF Document
- uri: data/0809.0899.pdf
PDFTableExtractor
Uses docs2info’s PDF table extractor service to extract tables from each PDF and store them as chunks. The Executor also tags each chunk with the element_type
of table
. It will store the table data in CSV format inside the Document’s tags
for when we want to display it in the search results.
The client code for the PDF extraction service has been released publicly, but the back-end code has not. Don’t upload any confidential files since they currently use all uploads for training their model.
During my testing I extracted data from PDFs and saved the data back into CSV files. The results were much better than other libraries (like Camelot) that I’ve used, saving tables precisely. I believe this is because this one is AI-powered, recognizing tables based on similar tables it’s “seen”, not based on trying to parse the (sometimes inscrutable) PDF structure.
Have you used this Executor (or others) to extract tables? How did it go? Let us know via Slack or social media!
Given time, we may replace this step with a different table extractor based on the same model so we have more transparency. But that’s a thought for another day.
Since there will be LOTS of chunks, we’ll just use one of each type in our visualization:
- PDF Document
- uri: data/0809.0899.pdf
- table chunk
PDFSegmenter
Extracts text and images from a PDF and stores them as chunks.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk
- text chunk
- image chunk
ElementTypeTagger
If chunk is an image, tag it with element_type
of image
. If it’s text, tag with element_type
of text
. If it already has an element_type
(like table
above), pass over it. This tag will be used for pre-filtering results when a user searches later on.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- text chunk (element_type: text)
- image chunk (element_type: image)
SpacySentencizer
PDFSegmenter extracted huge blocks of text lifted straight from parsing the PDF. Assigning a meaning (and thus embedding) to huge chunks is difficult, since they contain so much information. We’ll break it up into sentences to make it more semantically useful.
This Executor will store sentences as chunks. Since the text blocks are already chunks, this means it’s creating chunks of chunks.
Why not just use the vanilla Sentencizer? That splits on punctuation like .
or !
. In most sentences that’s fine:
- Bob likes Barbie dolls. Alice likes monster trucks. = 2 sentences, split by
.
- Hey! How’s it going daddy-o? = 2 sentences, split by
!
But…
- J.R.R. Tolkien turned to p.3 of M. Poirot’s novel = 1 sentence. But vanilla Sentencizer would see it as 6 because of all the
.
SpacySentencizer doesn’t have this problem since it uses an AI model to split the text, not just predefined punctuation characters.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- text chunk (element_type: text)
- sentence
- sentence
- image chunk (element_type: image)
TagsCopier
Previously we tagged our chunks with an element_type
— but that didn’t get copied down to the sentences when we used SpacySentencizer. This looks at the top-level chunk’s tags (stored as a dict
) and then merges them into the tags of the child chunks.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- text chunk (element_type: text)
- sentence (element_type: text)
- sentence (element_type: text)
- image chunk (element_type: image)
ChunkFlattener
Now our chunks are all over the place! ChunkFlattener puts them all on the same level, making it easier to work with (since we don’t have to use such complicated traversal_paths
in our Executors’ uses_with
parameter).
You may notice we still have the big chunks of text as well as the sentences. In my experience this doesn’t seem to cause much difference. However, YMMV, so please let us know on Slack if you’re getting strange results.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- text chunk (element_type: text)
- sentence (element_type: text)
- sentence (element_type: text)
- image chunk (element_type: image)
ImagePreprocessor-skip-non-images
This normalizes the tensors of all image chunks to make them work nicely with CLIP. That way all the indexed images will have the same tensor shape, as will a search image (if the user chooses to search by image, that is). This lets us compare “apples to apples” and thus provide meaningful image search results.
Note: This is a (potentially slower) fork of the original ImagePreprocessor. I forked it so it would work effectively on any level of Document rather than just the top-level Document.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- text chunk (element_type: text)
- sentence (element_type: text)
- sentence (element_type: text)
- processed image chunk (element_type: image)
CLIPEncoder
This Executor uses the CLIP model to create vector embeddings for the text, tables, and images. We use two different copies since we’re working with two different chunk hierarchy levels:
- Indexing: Encode each chunk of each top-level PDF.
- Searching: Encode the top-level text or image in the search term.
The Executors are named accordingly, with a search_
or index_
prefix.
- PDF Document
- uri: data/0809.0899.pdf
- table chunk (element_type: table)
- embedding
- text chunk (element_type: text)
- embedding
- sentence (element_type: text)
- embedding
- sentence (element_type: text)
- embedding
- processed image chunk (element_type: image)
- embedding
AnnLiteIndexer
Stores all of our embeddings and metadata on disk. We supply several parameters under uses_with
to specify which level should be indexed or searched, in this case the chunk-level. We also use columns
to specify which of our tags we should be able to filter by when we search, namely element_type
. Finally, n_dim
refers to the 512 dimensions of the CLIP model embeddings.
The Documents here look exactly the same as above, except now they’re stored on disk in the workspace
folder.
Indexing our data
Now that we’ve built our Flow, it’s time to index:
We’ll connect to our Flow with the Jina Client, and give the following parameters:
request_size
of1
since PDFs can be pretty big and we don’t want to overload our Flow.target_executor
to a regex, specifying our indexing Flow should only use Executors with the prefixindex_
orall_
.
Once indexing is finished our index will be stored in the workspace
directory.
Searching our data
Looking at our Flow, we’ve got relatively few Executors to target when we search. This is simply because our search term is much simpler than a full PDF Document, so we can cut out many of the steps. All we need to do is:
- Wrap our search term into a Document (since that’s the common format for the Jina ecosystem)
- If it’s an image query, convert it to a tensor and apply normalization (ImageNormalizer-skip-non-images)
- Encode our query with CLIP (CLIPEncoder)
- Search the index for the nearest neighbors (AnnLiteIndexer)
The last three task Executors are already in our Flow, with the name prefix of search_
or all_
.
We can set up our filter (for text, image, or data) using MongoDB style syntax, so the following would search all element types:
element_type = ["text", "image" "table"]filter = {
"element_type": {
"$in": element_type,
}
}
We can then run the search with:
with flow:
client = Client(port=flow.port)results = client.post(
"/search",
query_doc,
request_size=1,
parameters={
"filter": filter
},
show_progress=True,
target_executor="(search_*|all_*)"
)
Checking the results
If we input a text string like how fast can a trilobite run
we get the following results (with scores — lower is better):
Advances in trilobite research.
Score: {'value': 0.08061867952346802}
---
Secondly, we determine realistic trilobite travelling and maximal sustainable speeds.
Score: {'value': 0.09226852655410767}
---
It is thus reasonable to assume that larger trilobites were capable of higher speeds than smaller trilobites.
Score: {'value': 0.10349196195602417}
---
We may therefore expect that larger trilobites generally tended to set the aggregate pace.
Score: {'value': 0.10869520902633667}
---
This would be in line with variations in trilobite size ranges observed in different environments.
Score: {'value': 0.12568509578704834}
---
In these circumstances, larger trilobites would be expected to quickly overtake the smaller ones, returning the group to a higher speed.
Score: {'value': 0.1311168074607849}
---
If caught out of optimal drafting range or position, even momentarily, small trilobites would have been required to increase their output dramatically to sustain the speeds set by the leaders or to catch back up and relocate to optimal drafting positions.
Score: {'value': 0.13913565874099731}
---
In turn, if the energy saving quantity, 1 – D, exceeds the difference between the speed set by the front trilobite and the maximum speed of the follower, as a ratio of the speed of the front trilobite, the drafting trilobite will sustain the pace of the front trilobite.
Score: {'value': 0.14066028594970703}
---
Thus, it is apparent that if the MSO of a weaker trilobite exceeds the speed set by follow the front trilobite (S ), as reduced by D, then the weaker trilobite can sustain the front speed of a stronger trilobite by drafting.
Score: {'value': 0.14284193515777588}
---
Trilobites’ sustainable relative speeds and their corresponding metabolic requirements are unknown.
Score: {'value': 0.14556092023849487}
---
As we can see, many of the top results are related to trilobite speed. The top result (Advances in trilobite research
) seems to pop up for many search terms. Finetuning the CLIP model may be a good idea for a future notebook (or you can just see this existing notebook).
What about images? Let’s try a text to image query: trilobite diagram
If we filter for only returning Documents of element_type
image
, we get:
Bingo! Our top result is a diagram of a trilobite. Since there aren’t any other diagrams of trilobites in the dataset, the following results are CLIP “doing its best” and returning whatever it thinks is even vaguely relevant.
As for searching tables, well we currently only have one table in our dataset, so whatever search term we use will return that. Since the notebook doesn’t nicely display tables inline, you can check the “csvs” directory in the sidebar. Results are ranked by search order, with the closest matches at 0.
In future notebooks we’ll try searching more tables and refining our search experience. We’ll also host and deploy our Flow on JCloud for free, allowing anyone to access it via a RESTful or gRPC gateway.
Troubleshooting
No text is being extracted from my PDF
It might be that your PDF is full of pictures of text rather than text itself. This is quite common. In a future notebook we’ll integrate an OCR Executor like PaddlePaddleOCR to get around this.
I’m getting bad search results in my language
The CLIP model we’re using is trained primarily on English. Multilingual CLIP models do exist however. You can define which model you want to use with the pretrained_model_name_or_path
argument in CLIPEcoder.
My tables aren’t being extracted
The docs2info’s table extraction service is still being tested. While it’s provided good results in my experience, it’s still under heavy development.
The notebook fails when I do anything involving images
Try restarting the runtime (there should be an option for that near the top, under the !pip install docarray[full]
cell. This seems to be a notebook limitation.
It’s too slow!
Have you enabled Colab’s GPU under Runtime > Change runtime type?
Something else?
Join our Slack and ask us there in the #projects-pdf channel!
Learn more
Want to dig more into the Jina ecosystem? Here are some resources:
- Developer portal — tutorials, courses, videos on using Jina
- Fashion search notebook — build an image-to-image fashion search engine
- DALL-E Flow/Disco Art — create AI-generated art in your browser