Developing with IBM Watson Retrieve and Rank: Part 2 Training and Evaluation

IBM Watson Retrieve and Rank

In Part 1 we stood up a Retrieve and Rank cluster on Bluemix and configured solr. In Part 2 we will build our ground truth and train and test our ranker.

Training the Ranker

If machine learning can be defined as a set of techniques to make predictions from data where the performance of those predictions improves with experience, then ground truth is the experience. In the case of Retrieve and Rank, ground truth comprises examples of input questions and output documents with associated relevance labels. A perfect answer should get a high relevance score and a bad answer, a low score. Ground truth is then segmented into train sets and test sets that allow us to teach and evaluate our model.

I am going to pull questions from the documents themselves. This is generally not a good strategy for real-w0rld applications since those questions are probably not representative of real user questions. In practice, this can result in overfitting the model to the training data such that it doesn’t generalize for use in the wild, but it’ll work for the purposes of this blog.

Many of the topic headings in my documents are actually questions e.g. “How is Proctitis diagnosed?” Running groundtruth.js against solrdocs.json (the output of doc_conversion.js) will extract the topic and id fields which will form the basis of our ground truth.

node groundtruth.js -i solrdocs.json

Once we have our topic/id mappings extracted, we simply remove any topics that don’t represent a question (i.e. don’t start with How, What, Can) and add a relevance label. Relevance is normally represented on some scale e.g. 1–5 where 5 is a perfect answer and 1 is poor. All documents with no label are given 0 by default and they are discarded. In our case we have a simple ground truth with only a single correct answer for each question so we will just label each correct query/document mapping with a 1. Our final relevance csv should look like the following with about 200 query document pairs.

How is proctitis diagnosed?,a605c109–07c5–4670–9b21–3b52fe01a53f,1
What is an inguinal hernia?,da55e178-bfec-4ad6–90f6–2476d1962fd1,1
What is lactose intolerance?,99f04d62–79bd-4775–89b6-a08e6f36385a,1
How are gallstones treated?,215f0682–7092–424c-ac85–572992b50b6e,1
What is diverticular disease?,73f591d0-d43f-4fd1–85d9–581bb811b206,1
How common is cyclic vomiting syndrome?,95bcf703-f186–4342-a94b-6fafbe8ab28b,1

As you build ground truth for your real applications, make sure to follow the best practices in the R&R documentation.

Fig. 3 Training Data Best Practices

Next we split our ground truth into train and test sets. In this case I chose a random 70/30 split and those are represented by gt_train.csv and gt_test.csv. The split can impact how our model performs and when we experiment later we’ll want to make sure we haven’t overfit to our training data.

To train our ranker, we run the script against our relevance file. This script will query solr and capture a set of feature scores for each document returned. It will do this for all queries in our relevance file. The features represent semantic overlap between a query and document. In Part 3 of this post, we’ll look at adding custom features. The output of is pushed to a file called trainingdata.txt and sent to train the ranker.

python ./ -u username:password -i gt_train.csv -c sc3689b816_2b07_4548_96a9_a9e52a063bf1 -x niddk_collection -n “niddk_ranker”

Capture the Ranker_ID <42B250x11-rank-2255> then check the status of the ranker to confirm training is complete.

curl -u “username”:”password” “"

Evaluating Retrieve and Rank

Once we’ve trained our ranker we need to test its performance as compared to solr. There are a number of metrics that can be used to evaluate ranking algorithms and in this case, we’ll look at relevance@n. Essentially we are measuring what percentage of queries return a relevant answer in the first response, the top 2 responses, top 3 etc. In general, the metrics used to evaluate performance should be aligned with the business success criteria of your application. For example, if I intend to show 5 answers to consumer health questions in my application interface, relevance@5 may be the most important metric. On the other hand, if I am ranking potential cancer treatments, I care a lot more about relevance@1.

I used a few tools to run these experiments which I will link to soon (once we’re able to clean them up and put them on cognitive catalyst). The results can be seen in Figure 4.

Fig 4. Relevance@n

We can see that the Accuracy (relevance@1) improves from about 15% to 81%. Relevance@5 improves from about 51% to 84%. This looks like a huge jump on the surface, but in reality it shows that we can make a lot of progress by simply optimizing our solr configuration. For example, there are some noisy documents that appear to be scoring highly in solr. The “On This Page” section of each source document contains a summary of everything in that document. Therefore, solr out of the box is returning these documents for almost every query even though they are not truly relevant. An obvious configuration improvement would be to remove “On this page.” Having said that, we have clearly demonstrated the rankers ability to learn what semantic signal is predictive of relevance and use that learning at runtime to rerank answers.

In our use-case, the critical indicators in the query are the condition referenced and what we call the lexical answer type (LAT). In Question/Answering, the LAT refers to the word in the question that indicates the type of answer that should be returned. In the question “What are the symptoms of Appendicitis?,” the LAT is “symptoms” because the answer should be a list of symptoms and the condition is “Appendicitis.” Retrieve and Rank is not literally identifying a LAT yet (we’ll demonstrate how to build a LAT feature in Part 3) but the ranker is inferring the importance of symptom (or cause or treatment etc.) and the condition in predicting relevance of an answer.

In addition to looking at the performance of the test set, we want to compare the performance of our test and train sets. This will give us an indication of whether our model is overfit to our training data. In general, if the train set performance is significantly better than the test set, we are likely overfit and therefore our model will not generalize. In Figure 5 we can see that our train set performs slightly better than our test set. Interpreting these numbers is often as much an art as a science. We may have an issue with overfitting here, but its hard to know without further testing. Cross-validation is a technique that helps distribute experiments so that we can be more confident with our results. I’ll cover cross-validation in more detail in a future post.

Fig 5. Train vs Test Set relevance@n


Building an effective ground truth is arguably the most important task in building a Watson application. Ground truth is the experience from which the machine learning algorithm learns and is evaluated. It should be representative of the types of inputs and outputs the algorithm will see in the wild. Ground truth is divided into train and test sets and it’s important that these sets do not overlap. We should choose evaluation metrics that align with our business objectives. Finally we should expect to iterate these activities for the life of our application. If I was releasing this NIDDK app in reality, I would want to capture user feedback (explicitly through strategies like thumbs up/thumbs down or implicitly through clickstream and other interaction analytics). We can use this feedback to continually refine our ground truth and optimize our model.

In Part 3, we’ll look at how we can identify and add additional features to further improve Retrieve and Rank performance.

Like what you read? Give Chris Ackerson a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.