How to

Faceted knowledge base search with Jina and JCloud

From chatbot to AI-powered search engine, hosted fully on the cloud

Alex C-G
Jina AI
Published in
5 min readMay 31, 2022

--

Photo by ThisisEngineering RAEng on Unsplash

A while ago we built a simple chatbot using Jina’s neural search framework:

When you think about it, a chatbot is basically a stripped-down search engine:

  • Chatbot: Input a single query, get a single result. Interface is bare-bones.
  • Search engine: Input a single query, get multiple results. Interface may contain filters and/or metadata.

This makes our chatbot a good foundation for building larger projects, in this case a search engine for a given knowledge base.

You can play with our chatbot here. In this post we’ll take that code and build it into a better, faster, stronger knowledge base search engine. Alternatively, you can see the finished code here:

Get started

  1. git clone https://github.com/alexcg1/example-chatbot
  2. cd example-chatbot
  3. Create a virtual environment
  4. pip install -r requirements.txt
  5. pip install -U jina jcloud (let’s make sure we’re using the latest versions of Jina and JCloud)

Our dataset

Once again we’ll use the COVID-QA dataset:

While it’s out-of-date and should under no circumstances be referred to for medical advice, it does contain metadata that we can display and use to filter our results with faceted search.

Since we’re keeping our dataset exactly the same as our chatbot, we can just re-use our existing code which loads the CSV into a DocumentArray:

Our Flow

Next we’ll pass our DocumentArray into our Flow which generates our index for searching. Since our use case is quite different from a simple chatbot, we’ll need to adapt the chatbot Flow to fit our new use case:

  • Which model/encoder shall we use? Our prior model was specifically focused on answering questions, whereas we want to perform a more straightforward search operation.
  • Which indexer shall we use? Chatbot’s SimpleIndexer is…simple. It doesn’t support metadata filtering.
  • While we’re at it, let’s re-do the whole Flow in YAML so it’ll be easier to host on the cloud (more on that soon!).

Our changes

Most of the stuff we’ll keep the same, but we’ll swap out a few Executors for more suitable ones that we can get straight from Jina Hub. Since Hub provides a uniform interface for all our machine learning components, this means we’ll only need to change a few lines in our Flow (and not get bogged down with manually integrating stuff).

  • Model/encoder: Let’s go with SpacyTextEncoder and the en_web_core_md model. I’ve always found spaCy’s offerings very speedy compared to what I can get from huggingface.
  • Indexer: AnnLite supports filtering on metadata so let’s go with that.
  • YAML: We’ll migrate as go along.

Our old Flow looks like this:

And when we make our changes (and rewrite in YAML) it looks like this:

Because we’re using AnnLiteIndexer, it expects a few extra arguments, like dim (i.e. the dimensionality of our model’s embeddings) and columns. Our columns are the fields that we’ll use to filter our search results. Let’s see the field names in our data.csv . Using this command:

head -n data/community.csv

Gives us:

question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source

For each field that we want to use as metadata, we need to provide:

  • Field name
  • Field type (e.g. str , int , bool, etc)

In your own dataset, you may have useful metadata like:

  • Date
  • Quality rating
  • Number of replies
  • Author
  • Category

But in our case, only the answer_type and source fields provide any meaningful data for filtering. So now our YAML will look like:

Adapting our app

Now we need to ensure our app loads the Flow from YAML and not from Python directly. So we can remove our previous Flow:

and just replace it with this:

Running our app

Now we can index our data with:

python app.py -t index

And open up our RESTful interface for searching with:

python app.py -t search

Searching our knowledge base

Basic search

Let’s start by using Jina Client to perform a basic search, just trying to find similar text strings.

Faceted search

But what if we want to filter by metadata? Let’s filter down to only answers that have biomedical as their source:

You can find more information on creating search filters on the AnnLiteIndexer Hub page. If you want to see more complex filters in action, you can check the Jina fashion search example, or see its frontend code.

Hosting on the cloud

Hosting a search on our own machine is fine and dandy, but what if we want to deploy it for real on the cloud? Jina has your back. We can do that with JCloud.

We’ll need to make a few changes to our Flows. You can refer to our previous post on that:

And that gives us the following YAML:

We can deploy our Flow with:

jc deploy flow.yml  # make a note of the URL that comes up!

And (since we only indexed our data on our own machine), re-index (on the cloud) with the following code:

(instead of the flow.index(...) code we used earlier)

Searching our knowledge base in the cloud

Now that all our data is indexed on the cloud, we can search through it:

With cURL:

curl -X POST https://8bae9ebaf9.wolf.jina.ai/search -H 'Content-Type: application/json' -d '{"data":[{"text": "will my teacher give me covid"}], "execEndpoint":"/"}'

With Jina Client:

Alternatively, you can use jfc , which I created for rapid Flow tests (no, not the COVID-y kind):

jfc search "can I catch COVID from my pet tardigrade?" -h https://8bae9ebaf9.wolf.jina.ai

Be aware that jfc is just a small project I’m building and we don’t offer support. I don’t recommend it for production use, and the Happy Fun Ball disclaimer applies.

Next steps

  • We’ll leave integrating Jina Client and your frontend as an exercise for our dear readers. Hint: You can look at what we’ve done with our fashion search example.
  • Want to build your own knowledge base search? Join Jina’s Slack community and look for the #projects-knowledge-base channel

Issues? Questions?

Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆

Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋

--

--