How to

Build a neural-search-powered chatbot

Get chatty with DocArray, Jina, and Jina Hub

Alex C-G
Jina AI

--

Photo by Rock’n Roll Monkey on Unsplash

When most people think search, they think of a standard search box. Type words in, smack the search button, and pages of luscious results come back.

But search is buried elsewhere too. Those customer support chatbots you know and love? They’re doing search. But instead of returning pages of results, they only return the one most relevant hit, and do so in a conversational UI.

Today we’ll look at how we can build our own chatbot using the Jina ecosystem, and how to easily deploy it on the cloud.

This tutorial is based on our example chatbot repo and you can play with a live demo. That in turn is based on Jina’s jina hello chatbot .

Note that the code here is a lot simpler than in the repo, since we’re not too interested about actual deployment or efficiency, just in building our chatbot. However, the repo does contain a docker-compose.yml for easy deployment on the cloud.

What problem do we want to solve?

Close your eyes and take yourself back in time. No, not that far — pogs will never be cool again.

Just kidding. They were never cool to begin with. Photo by Emre Turkan on Unsplash

But back in time to the misty year of 2020, just at the start of the COVID-19 pandemic (some people think we’re still there). Everyone had questions, and you’d hear a lot of answers being thrown about which miiight not have held up too well when we look back.

We’re going to build a chatbot using a dataset of COVID-19 questions and answers from 2019. Just for the love of all that is holy, don’t trust random datasets you find on the internet for medical advice. This is just to build an example application and not replace your doctor.

How will it work for the end user?

From an end user’s perspective, all they need to do is type a question into the UI and one answer gets returned in a conversational interface. Just like any other chatbot.

How will it work under the hood?

We can break our chatbot code down into several steps:

  1. Process our data for loading into our search app
  2. Build and run a Flow to index our data (encoding it and storing those embeddings and metadata on disk)
  3. Run the same Flow to let users search with questions as input
  4. Open a nice GUI in the browser

Let’s get building!

You can refer to the repo for all code we discuss here. We won’t go too much into the UI, except to show where we’re interfacing with the Jina ecosystem.

Download the data

If you clone the repo, you’ll have the data already there. Alternatively, you can find the COVID-QA dataset on Kaggle.

We’re going to use the community.csv file, which has a bunch of fields. We only really care about question and answer for our use case.

Basic setup

We’ll create a file called config.py and put some values in there for basic settings. This saves us having to dig through our main app file every time we want to change something:

We’ll also need to install DocArray and Jina:

pip install docarray jina

Convert the data into a DocumentArray

Pretty much all of the Jina ecosystem runs on Documents, Jina’s primitive data type. These can contain text, image, audio, video, or any other kind of data. A bunch of Documents can be combined into a DocumentArray. All of this functionality is in the DocArray package.

In our case, we’ll treat each row of the CSV file as a single Document, then combine all of those Documents into a DocumentArray which we can then further process.

DocArray contains a nice from_csv feature to process CSV’s quickly, so let’s use that in app.py :

As you can see, we reference the DATA_FILE and NUM_DOCS that we set up before in config.py . We then map each Document’s text attribute to the question field in the relevant row. Other fields will automatically be added as metadata in the Document’s tags attribute.

Build our indexing Flow

We want to take our DocumentArray, and for each Document:

  1. Encode its content into a vector embedding
  2. Store it in an index for easy lookup later

We’ll do this with our Flow, which we build using Jina’s core

You can see that we’re using a Flow().add(...).add(...) arrangement to build our processing pipeline out of Executors:

  • Encoding: The first .add() adds our encoder Executor. We’ll use TransformerTorchEncoder from Jina Hub, running in a Docker container so we don’t have to worry about dependencies. We’ll also tell it which model we want it to use.
  • Indexing: With the final .add() we’ll bring in our indexer Executor. We just need something simple like the aptly-named SimpleIndexer. Again, we’re pulling it from Jina Hub but we won’t run it in Docker since it needs access to write the local filesystem (which we could do with a few uses_with and volume options, but that complicates the code for a simple example)

You can also see we’ve enabled an HTTP gateway and established the port in the Flow arguments.

Run our indexing Flow

We can open our Flow with:

To get it going, run app.py in your terminal of choice. The first run may take a while since it has to download our Executors from Jina Hub and then churn through all that data.

You may see a few warnings. Don’t worry about them. Lord knows I never do.

I only indexed a few Documents here, not the full batch

Afterwards you should see a workspace directory that contains your indexed data:

And inside that workspace folder you’ll see index.db , an SQLite database that stores your data:

If you’ve indexed the whole dataset, the workspace directory should take up around 5.4Mb.

Ask our chatbot a question

This time we’ll use the same Flow, but instead of indexing, we’ll search for an answer to a question. As we said before, everything in the Jina ecosystem is a Document, so we’ll:

  • Wrap the question into a Document
  • Pass that Document to our Flow
  • Get the closest questions returned from the Flow
  • Print the answer field from the closest matching question

We use the same two Executors as we did for indexing, but this time they act a little differently:

  • Encoding: Encodes your question text string into a vector embedding
  • Indexing: Searches the index for the closest embedding to the encoded search string and returns matching data

Now, if you run app.py again, you’ll get an answer like:

Covid-19 is the result of a zoonotic outbreak from bats to an intermediate ( and unidentified host ) thence to humans. It is thought that bats don’t suffer the disease themselves as they carry low levels of virus but the intermediate host modifies the virus, and amplifies it so that it vastly increases the amount of virus that can shed into the environment infecting humans. One possible scenario is that bat coronavirus combined with a fish coronavirus inside the intermediate host. This is based on the observation that the SARS-CoV-2 spike gene shares a 39-base insertion with a type of soldierfish that swims in the South China Sea….

If you keep reading, you’ll see the answer contains information about cats further down:

The SARS-CoV-2 infects humans via the ACE2 surface receptor, but this receptor is not only found in humans. Other animals have similar but not exactly the same ACE2 receptors and several dogs and one cat have returned positive swabs for SARS-CoV-2. The cat developed covid-19 symptoms.

In a more advanced example we would break the whole (long) answer down into sentence chunks and only return the most relevant chunk to the user.

Adding a UI

We won’t focus too much on the UI here, except for certain parts that connect to the Jina ecosystem. You can find the files here.

We’ll use Streamlit to build our frontend and the wonderful Streamlit-Chat module to add a chatbot-like interface:

pip install streamlit-chat streamlit

Once again we’ll set up a basic frontend_config.py :

The most important part of frontend.py is how we interface with our backend to get the top answer to a user’s question. We do this using Jina Client:

The rest of the frontend code is basically working to take the user’s input, send it to that function, and display the answer in a chat interface. Since it’s not super-relevant to Jina we won’t cover it here.

Connecting our UI to our Flow

Our search code from before is great for CLI prototyping, but now we need to open (and keep open) our RESTful interface in our backend app.py. We’ll comment out our initial search functionality since we’re no longer using that, and use flow.block() to keep a port open instead:

Then we’ll once again run our backend code:

python app.py

Followed by our frontend code in a new terminal:

streamlit frontend.py

Once everything has spun up, there you have it, a neural-search-powered chatbot!

Next steps

  • Already done: If we were going to run in the real world, we’d want to break things out into functions, get it running on Docker, and so on. We’ve already done that in our repo, so running and deployment is a walk in the park.
  • Breaking our dataset into smaller chunks: No one wants to read loooong answers, so it would be good to Sentence-ize everything and search through sentences rather than full text.
  • Finetuning our model: We’re just using a pretrained model that isn’t specialized for medical questions. To improve performance we can finetune that with Jina’s aptly-named Finetuner

Join the community

Got feedback? Want to learn more about the Jina ecosystem and neural search? Join us on Slack and swing by some of our events!

--

--