Create a Natural Language Question Answering system with IBM Watson

Recently I started a new project to build QA system for Mindvalley. After a short discussion with my colleagues we decided to use the 2011 jeopardy winner (Watson). As you may know Watson is a question answering computer system capable of answering questions posed in natural language, it developed in IBM’s DeepQA project by a research team led by principal investigator David Ferrucci.

IBM also released it’s cloud platform IBM Bluemix in 2014. It supports several programming languages and services as well as integrated DevOps to build, run, deploy and manage applications on the cloud. And of-course Watson API’s are accessible in Bluemix platform.

Before you begin

Before you begin, you need the following pieces

Overview of Question Answer System

The final goal is simple; design and develop an application be able to:

  • Analyze unstructured data
  • Understands complex questions
  • Presents answers and solution

But unfortunately, it is not as easy as advertised by IBM. In past IBM offered IBM Watson Question and Answer (QA) but in summer of 2015 this service replaced by four new services to support customize and embed Q&A applications.
 The four replacement services recommended for different types of question and answer capabilities are:

Natural Language Classifier; allows you to Interpret and classify natural language with confidence.

Dialog; allows you to script a conversation and help walk a user through a process

Retrieve and Rank; allows information retrieval with a machine learning model

Document Conversion; takes documents and ‘chunks them up” into smaller answer units to return as passages

based on your input data and requirements, you can choose one or combinations of these services to build a comprehensive and robust system.

My Approach

I decide to use combination of Document Conversion and Retrieve and Rank Services. Document conversion service used to transform documents into Answer Units, and use those as input to train the Retrieve and Rank Service.

Bluemix platform preparation

Before we start you need to prepare and setup your server and services. I decided to deploy my code in Bluemix as well, however you can just create your API in bluemix and deploy your code wherever you like (AWS :))

Login into you Bluemix account, in dashboard click on Service and APIs you need to create two services Document Conversion and Retrieve and Rank and copy and paste the credentials. (it is pretty easy I am not going to explain it)

Document Preparation

The first and most important step is document preparation. The more disciplined you are in your handling of data, the more consistent and better results you are like likely to achieve. Watson Document Conversion service helps you in this process so you don’t need to do it manually but we need to prepare our document for DC service. You may ask how you can prepare a document? the answer is by providing related question as header of each paragraph. DC service is sensitive to headers, it transform document to json format which contains list of answer_units dictionary. look at following example;

  • Document Sample
What is Watson?    Watson is an artificially intelligent computer system capable of answering questions    posed in natural language, developed in IBM's DeepQA project by a research team led    by principal investigator David Ferrucci. Watson was named after IBM's first CEO and    industrialist Thomas J. Watson. The computer system was specifically developed to    answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy!    against former winners Brad Rutter and Ken Jennings. Watson received the first place    prize of $1 million.      How did it Work?    Watson had access to 200 million pages of structured and unstructured content    consuming four terabytes of disk storage including the full text of Wikipedia, but was not    connected to the Internet during the game. For each clue, Watson's three most probable    responses were displayed on the television screen. Watson consistently outperformed    its human opponents on the game's signaling device, but had trouble responding to a    few categories, notably those having short clues containing only a few words.      And then, what happened?    In February 2013, IBM announced that Watson software system's first commercial    application would be for utilization management decisions in lung cancer treatment at    Memorial Sloan–Kettering Cancer Center in conjunction with health insurance company    WellPoint. IBM Watson's former business chief Manoj Saxena says that 90% of nurses    in the field who use Watson now follow its guidance.
  • Answer Units

Now we can use list of answer units for training Retrieve and Rank system.

Setup Retrieve and Rank System

This service helps users find the most relevant information for their query by using a combination of search and machine learning algorithms to detect “signals” in the data. Built on top of Apache Solr, developers load their data into the service, train a machine learning model based on known relevant results, then leverage this model to provide improved results to their end users based on their question or query.

Lets makes our hands dirty. if you have worked with Solr before, you may know we need our cluster and collection to index the data, so the let’s setup them.I used Node.js for this project before you start you need to install Node client library to use the Watson Developer Cloud services, a collection of REST APIs and SDKs that use cognitive computing to solve complex problems. npm install watson-developer-cloud --save

the next and last pre step is create conf file;

  1. Create Cluster
     To use the Retrieve and Rank service, you must create a Solr cluster. A Solr cluster manages your search collections, which you will create later.

2. Upload Solr Configuration
 The response includes cluster_id and the cluster availability status. The cluster must be ready before you can use it. it takes about few minutes to become ready but you can check the status through API.

After cluster become ready and before you can create collections you need to configure your cluster. The Solr configuration identifies how to index the documents so that you can search the important fields. you can download the sample solr config from here I may write separate post about solar config latter. you can upload one or many configuration to your retrieve and rank service.

3. Create Collection
 A Solr collection is a logical index of the data in your documents. A collection is a way to keep data separate in the cloud. you need to assign one of the uploaded configuration to your collection.

4. Index (Add) Documents
 Now everything is ready to index (add) the question and answers generated by Document Conversion service into the collection.

5. Create Ground Truth Ground truth is the collection of questions that are matched to answers. For the Retrieve and Rank service, you label the answers with their relevance to the question. The relevance label helps the ranker determine which features are the most useful. The questions, answers, and relevance labels are combined to create training data. we need to upload the training data to create and train a ranker.

The quality of training set is very important. Your training data need to have following quality criteria to be effective in improving the relevance of answers that are returned from the Retrieve and Rank service:

  • The file must contain at least 49 unique questions.
  • The number of records must be at least 50 times the number of fields that are identified in your Solr configuration. For example, if your collection defines five fields, you must have at least 250 records in your training data.
  • At least two different relevance labels must exist in the data and those labels must be well represented. A label is well represented if it occurs at least once for every 100 unique questions.

The relevance file that we need to use in next step it suppose to be in the following CSV format:

  • The answer ID is the unique key of a document which is indexed into your collection.
  • The relevance label is a non-negative integer (between zero and some upper limit). Higher numbers indicate higher relevance.
     For example:
    “what similarity laws must be obeyed.”,“184”,“3”,“29”,“3”,“31”,“3” “what are the structural and aeroelastic problems.”,“12”,“4”,“15”,“3” “material properties of photoelastic materials.”,“463”,“4”,“462”,“2”,“497”,“0”

6 Build Training Data & Ranker
 Good news is you are almost down :). IBM Watson team provide the python scripts which create a training set based on your ground truth csv. The train.py Python script file takes the following format as input:

train.py -u {username:password} -i {relevance_file} -c {cluster_id} -x {collection_name}\ [-r {solr_rows_per_query}] [-n {ranker_name}]

The script creates a file in your working directory called trainingdata.txt. That file is used to create the ranker. When the script is finished, the script window displays a new ranker ID and its status. We’ll need it when we rerank results at runtime.

7 Time for ask question

Let’s test the system. you can search your collection with or without ranker. searching without ranker is useful to create a ground truth.

  • search without ranker (search solar standard)
  • Search With Ranker

8 Help me to learn more
 you can collect the user satisfaction level, regarding to the system answers in order to train your system and improve the results. you can update your ground truth based on user input and repeat step 6.