Vertex AI Search for Wikipedia PDF Search Widget

Enterprise Search for the Wiki PDFs

Aniket Agrawal
Google Cloud - Community
5 min readJan 28, 2024

--

In the previous blog, we created a Wiki PDF chatbot powered by Google Vertex AI Conversation. Now, on the same PDF, we leverage Google’s enterprise search technology, or Vertex AI Search, for a search engine.

PDFs, as a double-edged sword, preserve formatting beautifully but make their contents frustratingly difficult to search. The popular shortcut ‘CTRL-F’ shortcut comes to mind first for this, probably. But what if you’re dealing with hundreds or thousands of them? Also, you are dependent on the literal presence of the search query keywords, so what about semantic search?

Google Cloud Enterprise Search using Vertex AI Search (f.k.a. Gen App Builder)

Traditional search engines often struggle with this unique structure, making it frustrating to uncover the knowledge you need. Therefore, a dedicated PDF search engine is required to tackle this challenge head-on by understanding the nuances of PDFs and indexing not only the text but also metadata like titles, authors, and keywords.

It should enable search to be by meaning, not only the 'keywords' and help your ‘query’ find its ‘answer’ needle in the haystack of your documents, websites, structured data, etc.

Question asks “Where is my answer?” :), Background: https://www.andyreynolds.com/image/I0000rdLqiFYoXlw

By leveraging the capabilities of Vertex AI Search, users can conduct meaningful searches that will return tailored search results to their queries from multiple PDFs. However, we deal with a single PDF here for the sake of simplicity and as a beginner-friendly kickstarter.

Download a sample Wikipedia PDF to create the unstructured datastore for the Vertex AI Search engine

Our Vertex AI Search engine can answer inquiries regarding Indian tourism in a variety of ways, based on the large amount of knowledge included in our sample Wikipedia PDF, ‘Tourism in India’. The response content is simply retrieved from the necessary indexed PDF text, table data, picture captions, and so forth, as seen in the sample GIFs below.

Search Results (left) and entire Configuration Window (right)

Procedure

So far, we’ve explored the objective and search engine GIFs. But how do we do it? This section covers technical and product details.

As a requirement for datastore development, we’ll first build a cloud storage bucket and upload the PDF file like this:

Uploading Wiki PDF in a GCS Bucket

Vertex AI Search is a managed service that enables us to create and deploy solutions based on Google Enterprise Search. It gives our search widget access to the large amount of data available in the Wikipedia PDF datastore about Indian tourism.

Let’s see in the below GIF how Vertex AI Search can streamline the overall process (you may have to zoom in by clicking on it!).

Entire procedural journey GIF

Summary

Building a PDF search engine with Vertex AI Search turns a complex data swamp into a well-organized knowledge base. It saves time, unlocks insights, and empowers everyone who interacts with your PDF collections.

Note: Should you have any concerns or queries about this post or my implementation, please feel free to connect with me on LinkedIn! Thanks!

Demo link: https://drive.google.com/file/d/1s8wKMnbkEfi_5jYyxH03D7uIoF4gZsu0/view?usp=sharing

Are you ready to start your quest? Embark on your Vertex AI Search adventure and conquer your PDF chaos using the following links:

Reference Links

Google Cloud Next ’23: Vertex AI S&C is now GA, hurray!

Feature release and renaming happen for a new product; keep an eye on this:

For more hands-on practice, you can try these labs to learn more:

--

--

Aniket Agrawal
Google Cloud - Community

AI/ML | Cloud Engineer at Google, GenAI | Cybersecurity | ML | NLP | Image Processing Research Enthusiast https://www.linkedin.com/in/aniket-agrawal-a18990266/