How to
Faceted knowledge base search with Jina and JCloud
From chatbot to AI-powered search engine, hosted fully on the cloud
A while ago we built a simple chatbot using Jina’s neural search framework:
When you think about it, a chatbot is basically a stripped-down search engine:
- Chatbot: Input a single query, get a single result. Interface is bare-bones.
- Search engine: Input a single query, get multiple results. Interface may contain filters and/or metadata.
This makes our chatbot a good foundation for building larger projects, in this case a search engine for a given knowledge base.
You can play with our chatbot here. In this post we’ll take that code and build it into a better, faster, stronger knowledge base search engine. Alternatively, you can see the finished code here:
Get started
git clone https://github.com/alexcg1/example-chatbot
cd example-chatbot
- Create a virtual environment
pip install -r requirements.txt
pip install -U jina jcloud
(let’s make sure we’re using the latest versions of Jina and JCloud)
Our dataset
Once again we’ll use the COVID-QA dataset:
COVID-QA
A collection of COVID-19 Q&A pairs and transformer baselines for QA evaluation
www.kaggle.com
While it’s out-of-date and should under no circumstances be referred to for medical advice, it does contain metadata that we can display and use to filter our results with faceted search.
Since we’re keeping our dataset exactly the same as our chatbot, we can just re-use our existing code which loads the CSV into a DocumentArray:
Our Flow
Next we’ll pass our DocumentArray into our Flow which generates our index for searching. Since our use case is quite different from a simple chatbot, we’ll need to adapt the chatbot Flow to fit our new use case:
- Which model/encoder shall we use? Our prior model was specifically focused on answering questions, whereas we want to perform a more straightforward search operation.
- Which indexer shall we use? Chatbot’s SimpleIndexer is…simple. It doesn’t support metadata filtering.
- While we’re at it, let’s re-do the whole Flow in YAML so it’ll be easier to host on the cloud (more on that soon!).
Our changes
Most of the stuff we’ll keep the same, but we’ll swap out a few Executors for more suitable ones that we can get straight from Jina Hub. Since Hub provides a uniform interface for all our machine learning components, this means we’ll only need to change a few lines in our Flow (and not get bogged down with manually integrating stuff).
- Model/encoder: Let’s go with SpacyTextEncoder and the
en_web_core_md
model. I’ve always found spaCy’s offerings very speedy compared to what I can get from huggingface.
- Indexer: AnnLite supports filtering on metadata so let’s go with that.
- YAML: We’ll migrate as go along.
Our old Flow looks like this:
And when we make our changes (and rewrite in YAML) it looks like this:
Because we’re using AnnLiteIndexer, it expects a few extra arguments, like dim
(i.e. the dimensionality of our model’s embeddings) and columns
. Our columns are the fields that we’ll use to filter our search results. Let’s see the field names in our data.csv
. Using this command:
head -n data/community.csv
Gives us:
question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source
For each field that we want to use as metadata, we need to provide:
- Field name
- Field type (e.g.
str
,int
,bool
, etc)
In your own dataset, you may have useful metadata like:
- Date
- Quality rating
- Number of replies
- Author
- Category
But in our case, only the answer_type
and source
fields provide any meaningful data for filtering. So now our YAML will look like:
Adapting our app
Now we need to ensure our app loads the Flow from YAML and not from Python directly. So we can remove our previous Flow:
and just replace it with this:
Running our app
Now we can index our data with:
python app.py -t index
And open up our RESTful interface for searching with:
python app.py -t search
Searching our knowledge base
Basic search
Let’s start by using Jina Client to perform a basic search, just trying to find similar text strings.
Faceted search
But what if we want to filter by metadata? Let’s filter down to only answers that have biomedical
as their source
:
You can find more information on creating search filters on the AnnLiteIndexer Hub page. If you want to see more complex filters in action, you can check the Jina fashion search example, or see its frontend code.
Hosting on the cloud
Hosting a search on our own machine is fine and dandy, but what if we want to deploy it for real on the cloud? Jina has your back. We can do that with JCloud.
We’ll need to make a few changes to our Flows. You can refer to our previous post on that:
And that gives us the following YAML:
We can deploy our Flow with:
jc deploy flow.yml # make a note of the URL that comes up!
And (since we only indexed our data on our own machine), re-index (on the cloud) with the following code:
(instead of the flow.index(...)
code we used earlier)
Searching our knowledge base in the cloud
Now that all our data is indexed on the cloud, we can search through it:
With cURL:
curl -X POST https://8bae9ebaf9.wolf.jina.ai/search -H 'Content-Type: application/json' -d '{"data":[{"text": "will my teacher give me covid"}], "execEndpoint":"/"}'
With Jina Client:
Alternatively, you can use jfc
, which I created for rapid Flow tests (no, not the COVID-y kind):
jfc search "can I catch COVID from my pet tardigrade?" -h https://8bae9ebaf9.wolf.jina.ai
Be aware that jfc
is just a small project I’m building and we don’t offer support. I don’t recommend it for production use, and the Happy Fun Ball disclaimer applies.
Next steps
- We’ll leave integrating Jina Client and your frontend as an exercise for our dear readers. Hint: You can look at what we’ve done with our fashion search example.
- Want to build your own knowledge base search? Join Jina’s Slack community and look for the #projects-knowledge-base channel
Issues? Questions?
- Found a bug or got a feature request? Leave an issue in our repo
- Got questions? Join our Slack and ask in our #support channel
Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆
Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋