Vertex AI Search for Wikipedia PDF Search Widget

Enterprise Search for the Wiki PDFs

Published in

Google Cloud - Community

5 min readJan 28, 2024

In the previous blog, we created a Wiki PDF chatbot powered by Google Vertex AI Conversation. Now, on the same PDF, we leverage Google’s enterprise search technology, or Vertex AI Search, for a search engine.

PDFs, as a double-edged sword, preserve formatting beautifully but make their contents frustratingly difficult to search. The popular shortcut ‘CTRL-F’ shortcut comes to mind first for this, probably. But what if you’re dealing with hundreds or thousands of them? Also, you are dependent on the literal presence of the search query keywords, so what about semantic search?

Google Cloud Enterprise Search using Vertex AI Search (f.k.a. Gen App Builder)

Traditional search engines often struggle with this unique structure, making it frustrating to uncover the knowledge you need. Therefore, a dedicated PDF search engine is required to tackle this challenge head-on by understanding the nuances of PDFs and indexing not only the text but also metadata like titles, authors, and keywords.

It should enable search to be by meaning, not only the 'keywords' and help your ‘query’ find its ‘answer’ needle in the haystack of your documents, websites, structured data, etc.

Question asks “Where is my answer?” :), Background: https://www.andyreynolds.com/image/I0000rdLqiFYoXlw

By leveraging the capabilities of Vertex AI Search, users can conduct meaningful searches that will return tailored search results to their queries from multiple PDFs. However, we deal with a single PDF here for the sake of simplicity and as a beginner-friendly kickstarter.

Download a sample Wikipedia PDF to create the unstructured datastore for the Vertex AI Search engine

Our Vertex AI Search engine can answer inquiries regarding Indian tourism in a variety of ways, based on the large amount of knowledge included in our sample Wikipedia PDF, ‘Tourism in India’. The response content is simply retrieved from the necessary indexed PDF text, table data, picture captions, and so forth, as seen in the sample GIFs below.

Search Results (left) and entire Configuration Window (right)

Procedure

So far, we’ve explored the objective and search engine GIFs. But how do we do it? This section covers technical and product details.

As a requirement for datastore development, we’ll first build a cloud storage bucket and upload the PDF file like this:

Vertex AI Search is a managed service that enables us to create and deploy solutions based on Google Enterprise Search. It gives our search widget access to the large amount of data available in the Wikipedia PDF datastore about Indian tourism.

Let’s see in the below GIF how Vertex AI Search can streamline the overall process (you may have to zoom in by clicking on it!).

Summary

Building a PDF search engine with Vertex AI Search turns a complex data swamp into a well-organized knowledge base. It saves time, unlocks insights, and empowers everyone who interacts with your PDF collections.

Note: Should you have any concerns or queries about this post or my implementation, please feel free to connect with me on LinkedIn! Thanks!

Aniket Agrawal on LinkedIn: Delving into the depths of long PDFs can be a daunting task. Hence…

Delving into the depths of long PDFs can be a daunting task. Hence, we leverage the chatbot, a conversational marvel…

www.linkedin.com

Demo link: https://drive.google.com/file/d/1s8wKMnbkEfi_5jYyxH03D7uIoF4gZsu0/view?usp=sharing

Are you ready to start your quest? Embark on your Vertex AI Search adventure and conquer your PDF chaos using the following links:

Reference Links

Tourism in India — Wikipedia

Tourism in India is 4.6% of the country’s gross domestic product (GDP). Unlike other sectors, tourism is not a priority…

en.wikipedia.org

Google Cloud Next ’23: Vertex AI S&C is now GA, hurray!

Vertex AI Search and Conversation is now generally available | Google Cloud Blog

Vertex AI Search and Conversation is now generally available. Build and deploy search engines and chatbots quickly and…

cloud.google.com

Feature release and renaming happen for a new product; keep an eye on this:

Vertex AI Search and Conversation release notes | Google Cloud

The Google Cloud console and the documentation at cloud.google.com have been updated to show the current product name…

cloud.google.com

For more hands-on practice, you can try these labs to learn more:

Use Vertex AI Search on PDFs (unstructured data) in Cloud Storage from a Cloud Run service | Google…

Learn how to make a query to Vertex AI Search from a Cloud Run service.

codelabs.developers.google.com