Turn your FAQ pages into conversational AI

J William Murdock
IBM watsonx Assistant
11 min readApr 27, 2021

Authors: J William Murdock, Guy Lev, Michal Shmueli-Scheuer, Jaymin Desai, Anastas Stoyanovsky, Christophe Guittet, Jim Hurne, Ofir Florenz, and David Konopnicki

Image shows the checkbox to Apply FAQ extraction in Watson Discovery
Check the box to turn on FAQ extraction in Watson Discovery

Note: The beta test for this feature is ending in the first quarter of 2023, and this feature will be removed at this time. We are still quite happy with how well the feature worked, but it hasn’t been popular enough to justify continued maintenance and investment to move it beyond beta into a generally available release.

Sometimes your customers face problems or have questions. It happens — nobody’s perfect. You want to get those customers the answers they need. More and more customers try to find answers by themselves before reaching out to your support service (especially the introverts). That’s good for reducing your costs, if your customers can actually find the answers they are looking for!

A common way to let customers find answers is to have one or more FAQ pages or documents. It is often difficult to navigate through this jungle. Customers struggle to find where to look for specific information. They can benefit from a centralized search mechanism available across multiple channels.

The new FAQ extraction (beta) capability in IBM Watson Discovery processes FAQ documents. It splits it up into pieces with each piece having one question and one answer. IBM Watson Assistant can connect to Watson Discovery using its search skill. Together these products can let you turn an FAQ document into a working AI answering agent with very little effort.

Why does it matter?

Question-and-answer pairs in an FAQ document are typically self-contained chunks of information. Splitting the document into these chunks of information makes search more effective. It makes it more likely that end users will find exactly the information they want. This is particularly important in applications where you only have a small amount of space to show results to a user. For example, a mobile search application can often show only a small amount of text. Similarly, a relatively small widget within a larger web page will have limited space. In these cases, it is very important to find the right text and not a lot of irrelevant text.

A question mark lamp
Image from Jon Tyson on Unsplash

How does it work?

FAQ extraction works in three phases:

  • The rule-based phase works on the plain text of the documents only. It identifies question sentences, non-question sentences, and other text (e.g., copyright notices, which are generally not complete sentences). To do this, it uses language-specific heuristics, such as the fact that English questions generally end with a question mark. It associates question sentences with the non-question sentences that follow them. It outputs a set of candidate question-answer pairs.
  • The precision-oriented stage then seeks to discard invalid question-answer pairs. It uses the full document structure and looks for questions that do not fit into the same structural patterns as the other candidate question-answer pairs. For example, if all but one of the candidate questions are italicized section headings, then the one remaining question may be a false positive (e.g., a rhetorical question).
  • The recall-oriented stage then seeks to find additional question-answer pairs. It also uses the document structure. It can find questions and answers that the heuristics in the rule-based phase missed.

What business problems does it solve?

Many enterprises have one or more frequently asked question (FAQ) documents. For example, IBM’s Customer Support FAQ page provides questions and answers about IBM Support. However, end users often want an interactive interface in which they ask questions and get answers.

Building and maintaining such an interactive system can consume a huge amount of time. With FAQ extraction, you can use a connector to an existing FAQ page or document and let the automatic synchronization keep the system up to date. You can also simply upload/update one or more FAQ documents.

If you want to quickly build a complete FAQ answering system, you should consider combining Watson Discovery and Watson Assistant. You can start by creating a search skill in Watson Assistant and then connect it to Watson Discovery and point Watson Discovery at your FAQs. For more details on how to do this, see the next section. Here are some reasons why this can be better than Watson Discovery alone:

  • Watson Assistant provides a fully operational web user interface. It is very quick and easy to deploy your system onto a web page.
  • Watson Assistant also provides integrations with other platforms such as Facebook and Slack.
  • IBM Watson Assistant lets you build complex dialog flows. Your virtual assistant can ask clarification questions, process transactions, perform computations, etc.

If you want a simple system to answer the FAQ questions, you might not need all that power right away. However, you may want this power in the future. Exposing the FAQ extraction through Watson Assistant makes it easy to add these things later.

How do I get started?

Clouds over Paris, used here to symbolize cloud computing
Image from J Shim on Unsplash

Here is an easy way to get started using FAQ extraction through Watson Assistant and Watson Discovery:

  1. Log in to IBM Cloud

2. You will need an instance of Watson Assistant (any plan except “Lite” including the free “Plus Trial” plan) and Watson Discovery (any plan including the free “Lite” plan). If you do not have these, click on the Create Resource button and follow the instructions to create each of them.

3. Click on your Watson Assistant instance (on the Resource List page) and then click the “Launch Watson Assistant” button to enter the Watson Assistant tooling.

Screenshot of Watson Assistant page. There is a left-hand menu with “Manage” highlighted. The main body of the page shows a highlighted button with “Launch Watson Assistant on it” and a section on Credentials with fields “API key” and “URL”
In Step 3, the “Launch Watson Assistant” button is emphasized in blue

4. Click on “Create Assistant,” enter a name for your new assistant, and click “Create assistant.” For example, if you want to create an assistant to answer questions about COVID-19, you could call it “COVID-19 Answering Agent.”

5. Click on “add search skill” and under the “create skill” tab enter a name for your search skill. For example, if you want to create a skill from the content on the site faq.coronavirus.gov you could name it “faq.coronavirus.gov search.” Hit “Continue.”

6. Select your Watson Discovery instance from the pull down menu.

7. Click on the button to create a new collection or a new project (you will have one of these options available depending on the version of Watson Discovery that you created back in step 2).

screenshot of Search Skill page, in this case for faq.coronavirus.gov serach. There is a dropdown menu labeled “Choose a Discovery instance to connect to”, an option to create a new collection, and a list of collections to choose from with only one asset listed.
The “Create new collection” button is on the right side of the screen for Step 7 when using Watson Discovery v1

8. The details for step 8 are slightly different for Watson Discovery v1 and v2:

  • 8.a. In Watson Discovery v1, click “Web Crawl” and then enter a URL for a page or site with FAQs and click “Add”. If you pointed to a single FAQ page (e.g., https://faq.coronavirus.gov/vaccines) and you only want content on that one page, you should then click on the gear icon next to that URL and enter 0 for “Maximum hops” to prevent the crawler from following links to other pages. Alternatively, if you wanted to pull in a whole site (e.g., https://faq.coronavirus.gov/), then you do not need to click on the gear icon and Watson Discovery will crawl the whole site by default. Next, check the box next to “Apply FAQ extraction” and hit “Save & Sync.” Wait for it to process the data and report how many documents were ingested (this can take a few minutes or many hours depending on how big and complex your site is). Click on “Finish setup in Watson Assistant.”
  • 8.b. In Watson Discovery v2, click on any of the data sources or “Upload Data” and then provide your FAQ content. For example, you can click on “Web Crawl,” then “Next,” and then put a URL in the “Starting URLs” field and hit “Add”. As with v1, if you pointed to a single FAQ page (e.g., https://faq.coronavirus.gov/vaccines) and you only want content on that one page, you should then click on the settings icon (which is a pencil and paper in v2) next to that URL and enter 0 for “Maximum number of links to follow” to prevent the crawler from following links to other pages. Following this, check the box next to “Apply FAQ extraction” and hit “Finish.” Wait until the ingestion completes and then click “Back to Watson Assistant.”

9. Make sure the new collection or project you just created is selected and click “Next.”

10. This takes you to the search skill configuration page. You should see a checkbox and the word “Applied” in the “FAQ extraction” section because you checked the “Apply FAQ extraction” button in Step 8. You can adjust other settings on this page, but you probably do not need to. Hit the “Create” button.

11. Finally, click on the “Preview” link to try out your assistant right away. By default it will start by showing a few sample answers and then you can type a query and it will show other answers. If you want a different starting behavior, you can configure that in a dialog skill (see next step).

A screenshot of the assistant preview. On the left, there is a preview of a chatbot with a dialogue about COVID-19 vaccines. On the right, there is an option to copy a link and an option for integrating the web chat.
Preview in Step 11

12. (Optional) Close the preview and click on “Add dialog skill.” See the dialog skill section of the Watson Assistant tutorial for an introduction to dialog skills. They can complement extracted FAQ’s nicely by allowing you to train more complex conversations and let you expand the range of questions your assistant can answer.

When does it work well?

We have tested FAQ extraction on thousands of web pages. We see it working well in variety of languages including English, Spanish, French, German, Dutch, Finnish, Norwegian, Polish, Romanian, Russian, and Hebrew. It is generally around 85% accurate on web pages in those languages, and we would expect similar accuracy on similar languages. However, accuracy varies significantly from page to page. Some FAQ pages are split perfectly into questions and answers. In a small percentage of the FAQ pages, the FAQ extractor fails to identify any question-answer pairs.

It is fairly common for FAQ extractor to identify most of the questions on a page and get most of the text for most of the answers. If you really need all the text in your results to be 100% perfect, then FAQ extractor might not be the right tool for you. An alternative is to manually copy and paste question-answer pairs into JSON files. You can then upload them to Watson Discovery as JSON. This can get you perfect text if done well, but it is a lot more work. FAQ extraction is a better solution for users who are willing to accept results that are often quite good but less than perfect. This can be valuable in cases where you don’t really know exactly which content is going to be critical to your application. You can build a working system very quickly with FAQ extraction. Then you can deploy it right away and observe what real users do with your system (e.g., using the Analytics feature of Watson Assistant). You can then invest time on getting flawless content in response to information needs that many users do really want.

Once you are ready to put more time and effort into fixing any issues with the content from the FAQ extractor, there are a couple of approaches to doing so:

  1. You can train an intent in the Watson Assistant dialog skill editor and provide a response for that intent in the dialog. This requires that you provide at least five sample queries for the intent and you need to manually enter the response that you give (and update it when it changes). This is a lot more effort than FAQ extraction, but it is usually worth the effort for the most important queries (i.e., ones that are asked very often and/or have big impact on your business when they are answered correctly). In addition to allowing you to get the content exactly the way you want it, the fact that you are training an intent will typically make Watson Assistant very precise at answering variations of this question.
  2. You can use the Watson Discovery update document API (in v1 and v2) to update the text of a particular question/answer pair. If you want to do this at scale, you should probably build a script or user interface for making these updates. If you just want to edit a few documents you can do that by directly calling the API using a tool such as cURL or Postman. If you do this, you should probably configure your collection to stop automatically recrawling. Otherwise, your edits will get overridden the next time the source page gets recrawled.

When does it not work well?

In our experience, the FAQ extraction beta test does not work well on Chinese, Japanese, or Korean, mainly because it is more difficult to detect questions in those languages. We are working on improvements for those languages and we expect them to be handled well before this feature finishes beta testing. You can try it now in these languages and see if it works better for you.

Many FAQ pages require that you click on a question to see an answer. In our experience, FAQ extraction works well on many of these pages. Questions and answers often appear as sequential text inside HTML structures. For example, IBM’s orders and delivery FAQ page works this way. FAQ extractor is able to pull questions and answers from it. However, FAQ extractor generally cannot handle pages that only link questions to answers via JavaScript code. It also often can’t handle FAQ sites that open separate pages in new tabs or windows to provide answers.

The FAQ extractor was originally developed for HTML documents (web pages) and works best on documents of this sort. The precision-oriented and recall-oriented phases of FAQ extraction need HTML tags to encode the structure. With that said, we’ve also seen cases where we get good results from FAQs in plain text and Microsoft Word (using the logic of the initial rule-based phase). We have not seen any cases where it works well with PDF documents.

What versions of IBM Watson have FAQ extraction?

Watsons Bay, Australia used here to symbolize IBM Watson, which is admittedly quite a stretch, but it is a very nice picture.
Image from Mitchell Wallace on Unsplash

FAQ extraction is available as a beta testing feature with the following restrictions on availability:

  • In IBM Watson v1 on the public cloud, you can only turn on FAQ extraction when creating a search skill in IBM Watson Assistant.
  • In IBM Watson v2 on the public cloud, you can turn on FAQ extraction whenever you create or update a collection either while creating a search skill in IBM Watson Assistant of while using the Watson Discovery tooling on its own.
  • In IBM Watson Discovery for Cloud Pak for Data, FAQ extraction is not available during the beta testing period. If the capability passes beta testing, then we expect to deliver this capability on IBM Watson Discovery for Cloud Pak for Data soon after.

Conclusion

If you have an FAQ page with questions and answers, you can automatically make a virtual assistant that knows those answers to those questions. Create an assistant using IBM Watson Assistant and then add a search skill to connect to IBM Watson Discovery. The FAQ extraction feature of IBM Watson Discovery can then ingest your FAQ page and turn it into a set of questions and answers. This can get you a working assistant right away. You can deploy it immediately and start seeing what your users ask. You can then use other features of IBM Watson Assistant to further improve the assistant as you learn more about what your users are asking for.

Acknowledgement

Special thanks to Dakshi Agrawal for helping to guide the creation of this blog post and to Jana Thompson for editorial guidance and content suggestions.

--

--

J William Murdock
IBM watsonx Assistant

I am a computer scientist in the Watson Research Center at IBM. I work on IBM Watson cloud computing AI services. http://bill.murdocks.org/