How to

Build Neural Text Search Engine in 10 minutes

Improve search relevance by going beyond the conventional keyword-based search!

Shubham Saboo
Jina AI
Published in
3 min readMay 6, 2022

--

With neural search, you can find what you’re looking for without even knowing what you’re looking for!

RE: Neural Search

Neural search leverages deep neural networks to intelligently search through all sorts of data, including images, videos, and PDFs. This innovative approach provides a much more comprehensive and contextual search than traditional text-based search engines.

Tutorial → Setting up the Neural Search Engine

Let’s dive into a quick tutorial to set up your neural search engine for textual data using Jina’s DocArray in Python. This blog does not serve as an introduction to neural search. Check out this article for background.

The idea behind this tutorial is to build a neural search on textual data. It will work by taking in the query sentence as input matching it with sentences in our dataset and returning the matched sentences as the output.

👩‍💻 Technical StackJina’s DocArray, Python 3.7+

📚 Database UsedPride & Prejudice e-book

Installing Dependencies

To get started, you need to install DocArray from PyPI. To do that, you can use the following command:

  1. Via Pip: pip install docarray
  2. Via conda: conda install -c conda-forge docarray

Code Walkthrough

First, we need to load the dataset from a URL, convert it into text, and put it into a Document.

Next, since our dataset is an amalgamation of long sentences, we need to break it into smaller chunks that can be converted into a DocumentArray. We split the sentences using the ‘\n’ symbol i.e. whenever a new line is encountered. We store that sentence as a Document in the DocumentArray.

Next comes the vectorization of features (i.e. we need to convert our features into indices in a vector/matrix). The features in this example become the embeddings of each Document in our DocumentArray.

There are many ways to do this but a faster and space-efficient way is to use feature hashing. It works by taking the features and applying a hash function that can hash the values and return them as indices. But, DocArray saves us from the computation, and using feature hashing is as easy as a single line of code.

To get the Output 👉

We take the query sentence and convert it into a Document, vectorize it, and then match it with the vectors of the Documents in the DocumentArray.

Let’s take the query sentence "she entered the room" from Pride and Prejudice and see what response we get.

🥷 Neural Text Search in Action

We have created a colab notebook with a code walkthrough where you can build and run the text search engine in the cloud using the Pride and Prejudice dataset.

Follow along with the colab notebook 👉

References

  1. https://docarray.jina.ai
  2. https://github.com/jina-ai/docarray
  3. https://docs.jina.ai

If you would like to learn more or want me to write more on this subject, feel free to reach out.

If you liked this post or found it helpful, please take a minute to press the clap button, it increases the post's visibility for other medium users.

--

--