Photo from Ron Dauphin on Unsplash: A sailboat, used here to symbolize discovery

IBM Watson Discovery: Relevancy training for time-sensitive users

Published in

IBM Data Science in Practice

26 min readOct 28, 2020

IBM Watson Discovery has extensive documentation (for version 1 and version 2) on how to do relevancy training. There are also useful documents that provide helpful tips and tricks such as How to get the most out of relevancy training and Improve your natural language query results from Watson Discovery. Anyone interested in this topic should read those. This document is intended to complement those other documents by providing more specific guidance for users who want to move very quickly and have a limited amount of time to spend on this issue. It addresses users under different levels of time pressure and provides tips on how to make the most of that time.

Background about relevancy training

Here are some key things to understand about relevancy training:

Most important: Watson Discovery has excellent out-of-the-box information retrieval technology. It delivers almost the best effectiveness Watson Discovery can provide as soon as you put documents into it. The incremental benefit of relevancy training is rarely more than a few percent. If you only learn one thing from this document, it should be this. Many users think that they must spend a huge amount of time on relevancy training to get decent results. This is not true — the results you get right away are quite good for the task of finding relevant documents.
On the other hand, if you expect that Watson Discovery will find the right document nearly 100% of the time, then you will probably be disappointed. If you have designed an application that requires nearly perfect search, it is tempting to hope that relevancy training will do that. For example, you may want to only show your users one single search result for each query. That might be ideal if Watson Discovery could always rank the best document first. Often it does not. Relevancy training will not fully solve this problem. In this situation, the time you would spend on relevancy training would be better spent rethinking your application. If your application provides a good experience for users with untrained Discovery, then you can expect it to be even better with relevancy training. But if your application provides an unacceptable experience with untrained Discovery, then relevancy training is unlikely to solve that problem.
If you do have the time available, relevancy training Watson Discovery can be a good investment even with only modest benefits. For example, if you get thousands of queries per month then improving by a few percent can mean dozens more users getting the information they need. For many applications, this benefit justifies the effort required in the long run.
Watson Discovery relevancy training learns a general model for all queries. It focuses on how important it is to match keywords in different fields of each document and how important it is to match keywords that are close to each other in the queries to keywords that are close to each other in the documents. It does not learn anything specific about particular queries or particular documents. Some Watson Discovery users feel they need to make sure their relevancy training data provides coverage for all their topics. They then skew their training data to provide this coverage. This is generally counter-productive. It is generally better to train on a random sample of real user queries as described later in this document.
In some cases, Watson Discovery users know that there are some specific queries that are particularly important. They often want to make sure that their customers get the right documents for those queries. Because relevancy training is only used to train a general model, it is not helpful for this purpose. Watson Discovery v2 has a curation API to handle these cases. If you use that API to curate a response to a query, that response will always appear before any non-curated response to that query. Curation only learns about a specific response to a specific query and never generalizes to other queries or other responses. Relevance training only learns a general model and has little or no impact on the specific query used to do that training.
The Watson Discovery relevancy training API lets you specify a relevancy score from 0 to 100 for each search result. It is possible to say that some search result is slightly relevant and some other search result is more relevant. The relevancy training tooling (see image below) does not let you set specific relevancy levels. It treats anything marked “Relevant” as 10 and anything marked “Not Relevant” as 0. Watson Discovery also treats anything that is not marked at all has having a relevancy score of 0. Tooling does not allow other scores because most of the value that the relevancy training provides is from contrasting documents that are not relevant to ones that are relevant. There is much less benefit to contrasting slightly relevant documents with more relevant ones. Making fine-grained judgements about how relevant a document is generally takes a lot more time than deciding if it is relevant or not. For most users, that time is better spent labeling more search results for more queries instead.

Tooling for Watson Discovery v2 Relevancy Training

The following sections provide guidance depending on how much time you have available.

User Profile 1: I want to spend one hour or less on relevancy training

Photo from Henry & Co on Unsplash: A speed bump, used here to symbolize a small amount of effort

If you have an hour or less to spend on relevancy training, it is probably better not to do so. As noted in the “most important” point above, Watson Discovery has excellent out-of-the-box information retrieval technology. Discovery is highly accurate as soon as you put your documents into it. Your end-users will get good results nearly as often as they would get if you invested a lot of time on relevancy training. For many customers, that is a better value than spending a lot of time to get a modest improvement.

Note that it is not impossible to do relevancy training in under an hour. Watson Discovery’s interactive tooling makes it fairly quick and easy to do relevancy training. You enter queries, see search results, and mark them as relevant or not relevant. If you work quickly, you can often do more than one per minute. Watson Discovery is sometimes able to train a relevancy model with around 50 queries (it may take more, especially when there are no relevant documents for some of the queries). Training with only 50 queries or so might make the system better. However, the probability of getting a substantial improvement from training is low at that amount of data.

If you have an hour or less and you want to make your Watson Discovery instance the most effective it can be, you should probably focus on content. Are there other documents available that have information that your users might want? If so, consider adding them to Watson Discovery. Are there documents that are in Watson Discovery that are irrelevant or outdated? If so, consider removing or replacing them with better documents. This is much more likely to be time well spent.

Once you have the right content, another quick activity is curation. If you know that some queries are likely to be very frequent or important, then curation can get the best possible results for those queries. For example, say your company sells a product called “Examplomatic”, and you have one document with a general overview of the product and hundreds of other documents that mention the product. Some users may issue a query that is just the product name, “Examplomatic.” You want those users to get the general overview document as the top search result. You can curate that overview document as the best response for that query via the Discovery v2 curation API. Relevancy training is not useful for this task because it only learns a general model for how to weigh term matches. There is no way for relevancy training to learn that one document is more ideal for this query than others that also mention the product name.

User Profile 2: I want to spend a few hours or days on relevancy training

Photo from Jakob Owens on Unsplash: A gentle hill, used here to symbolize a modest undertaking

If you have already done the best you can on getting the right content into the system and you have more time, then relevancy training may be a good fit. Generally, the best way to do this is to start with real queries from real users of your system to use as training data. Since you don’t have any real queries before you deploy the system to users, generally the best practice is:

Deploy your application for real users without doing relevancy training.
Log all the queries using secure data storage capability such as IBM Cloud Databases or a reputable open source database.
Once you have gotten at least a few hundred queries from real users, grab a random sample of them. Get at least one hundred and preferably several hundred or more.
Run these queries one at a time in the Discovery relevancy training tool. Mark which documents are relevant and which are not relevant.

(Side note: Once you have real queries from real users, you can also observe which queries are particularly frequent. Then you can use the curation API to curate ideal results for those queries as recommended in the last paragraph of the previous section).

If your search system is replacing an existing system, you can pull queries from the search logs of that system. This will let you start training before you deploy your Watson Discovery application. However, the queries that you get from those old logs may not be representative of what users of your system will do. For example, if your application looks different or is accessed in a different way or has a different set of users, you can get different queries. If your old system had a lot of long, complicated queries, then Watson Discovery may learn to get much better at those queries at the expense of getting a little worse at short, simple queries. That is a net improvement because the long, complicated queries are quite frequent. A big improvement for those queries helps more than a small degradation for short queries hurts. A different design or context for your Watson Discovery application may cause users of that application to use short queries much more often. The overall effect of this model can be worse for that use case than the default untrained behavior for Watson Discovery.

It is also possible to make up your own sample queries to train Watson Discovery with that made up data. However, it is even less likely that your data will be truly representative of what real users will do. In the previous paragraph we talked about data having unrepresentative length. Even if you match the query lengths that you will get from real users exactly, there will be an unlimited number of other traits that can have subtle effects. For example, queries can differ in:

Are there many documents in the corpus that are relevant to the query or just a few?
Are there complete phrases, clauses, or sentences in the query?
How many different parts of speech do they include (nouns, adjectives, verbs, etc.)?
How often are they are asking about the main topic of the documents that the users want?

Relevancy training is about balancing different alternative options. That balance almost always gets worse results for some questions to get better results for a larger number of questions. There is no one model that is better than the default untrained behavior of Watson Discovery for all kinds of questions. Remember from the “most important point” in the background section: the default untrained behavior of Watson Discovery is really quite good. It is easy to generate a training set where each query is quite realistic but the frequencies of specific kinds of queries are not consistent with what real users will ask. This would be non-representative sample of data and it is not a good idea to train with that sample.

(Boring side note for skeptics: You may wonder why grammatical structure would matter. Is Watson Discovery using information about grammatical structure in the relevancy model? It is not. However, it is using information about how close the terms that match are to each other in the queries and the documents. Different kinds of grammatical constructs differ in how close they tend to be in relevant text. For example, in many languages adjectives are often very close to their nouns. Verbs are often much farther. Thus the importance of matching terms close together can depend on this structure. If you don’t know how frequent the uses of adjectives, nouns, and verbs will be then you are not ready to invent your own training queries. You should wait and see what queries your users issue. Do not try to make up queries and hope that the frequencies are close enough between your made-up training data and real user queries.)

Another way to get sample queries is to find sample users and get them to make up sample questions. This is also generally a bad idea. You may be able to find sample users who are quite similar to real users. If you ask them to make up queries, they will usually give you queries that are each quite plausible. However, the frequency of different kinds of queries is not the same as what you get when users actually try to use your system. I do not recommend this approach. Instead, make the system available for real usage and store all the queries that get asked (as described at the start of this section).

At the level of investment of User Profile 2, it is probably not a good idea to try to directly measure the benefits of relevancy training. This may seem kind of crazy at first. Why would anyone spend hours or days training Watson Discovery and not confirm that it actually got better? There are three key drivers of this recommendation:

With a few hours or even a few days of work, it is generally not feasible to test accuracy on enough data. With that much effort, you cannot reliably measure whether relevancy training is having a positive impact. Many people try a few queries to see if they get better or worse after training. However, when training does work it generally makes the system worse on some queries, better on some more queries, and about the same on even more queries. So if you test on a few examples, the results are essentially random luck. The few you tried could be ones where the system is better or worse or about the same. For more details on how much testing you need to do to find out if relevancy training helped, see the “second boring side note for skeptics” below.
IBM does test Watson Discovery on a wide variety of data sets. We see that relevancy training often improves accuracy significantly and never harms accuracy significantly. We only test on data sets where the training data is representative of the data we use to test. Typically we select train and test data randomly from a common pool. As noted above, training on non-representative data may actually make accuracy worse. If you have representative training data, relevancy training probably helps and almost certainly does not hurt. Any testing you can do in a few hours will not change this conclusion.
You can observe the consequences of using the system by recording user behavior. I discuss this idea later in this article (in the “Measuring impact via click-through rate” section).

If you want to spend a few hours or days on relevancy training you should plan to trust IBM that relevancy training is effective if you train it on a representative sample of data and/or measure the impact on the behavior of the users. Do not plan to directly measure the accuracy because it is too time consuming to do that sort of verification. If you are not comfortable with this approach, consider moving to another user profile. You can move to “User Profile 1” and skip relevancy training because Watson Discovery is already good without it. You can move to “User Profile 3” and make a much larger investment in relevancy training that includes a comprehensive assessment. In User Profile 2, measuring accuracy is not feasible.

(Second boring side note for skeptics: You may be wondering why it is not feasible to test whether relevancy training is having a positive impact. There is inherent uncertainty in a probability estimate based on testing. If you test an event with N samples, there is a roughly 95% chance that the probability is within one over the square root of N of the observed frequency. This rule applies for any binary random event event with a probability that is static and not extremely high or low. For example, say I have a coin that I suspect is slightly unbalanced. I flip it 100 times, and it comes up heads 44 out of 100 times. I can conclude that there is a 95% chance that actual probability that this coin comes up heads is 44/100 (44%) plus or minus 1/sqrt(100) (10%). So I would conclude that the actual probability that the coin comes up heads whenever it is flipped is probably somewhere between 34% and 54%. If my goal was to figure out whether the coin is weighted to heads or tails, then the experiment was mostly a failure. The only thing I really learned was that 100 flips is not enough to find out. I really could have guessed that without wasting time flipping the coin. Math tells me that flipping a coin 100 times can only give me a +/-10% estimate of the probability. That is not precise enough to assess a coin that is slightly weighted to one side or the other. The same math applies to Watson Discovery. Suppose I take 100 queries and check to see how often Watson Discovery is getting a relevant document in the top 5 search results. I will get an estimate of the probability of getting a relevant document in the top 5 search results that is within plus or minus 10%. Remember that relevancy training tends to help by several percentage points. It is not useful to get an estimate within 10% of how accurate the system is before relevancy training and then another equally imprecise estimate of how accurate the system is after relevancy training. This experiment is typically a pointless waste of time. It will generally not provide useful information about whether relevancy training was effective because the benefits are likely to be much smaller than the margin of error for the experiment. If instead of 100 queries you try this with 1,000 queries then the measurements are likely to be within plus or minus 1/sqrt(1,000) which is around 3%. That is precise enough that the effect of relevancy training might be evident. However, with that much uncertainty there is a pretty good chance that it will not be. If you try this with 10,000 queries instead, now the measurements are likely to be within plus or minus 1/sqrt(10,000). That is around 1%. So with 10,000 queries it is likely that any improvement in accuracy from relevancy training will be apparent. However, checking 10,000 queries to see if you are getting good results from them is an enormous amount of work and is not feasible for User Profile 2.)

(Third boring side note for skeptics, even deeper into the weeds: There are statistical significance tests that are more powerful than estimating a 95% confidence interval as 1/sqrt(N). You can see a very detailed overview of how to apply such tests to information retrieval in A Comparison of Statistical Significance Tests for Information Retrieval Evaluation by Mark D. Smucker, James Allan, and Ben Carterette. In practice the general rule-of-thumb of 1/sqrt(N) as the confidence interval is pretty good for planning purposes. If you are planning to test some change to your system with N test queries and you expect the impact of your change to be around 1/sqrt(N) or less then this experiment is not useful. In this case, you should expect that you do not have enough test data to measure the effect of this change. Thus you should either plan to get much more test data or you should not bother testing and use the configuration that should be better in principle. There is never enough test data to accurately measure every configuration change. In practice, some decisions need to be made on the basis of doing what seems right.)

User Profile 3: I want to spend a few weeks or months on relevancy training

Photo from Rohit Tandon on Unsplash: A mountain, used here to symbolize an enormous undertaking

First, seriously reconsider this plan. Do you really want to spend that much time on this? As noted in the background section of this article, the benefits of relevancy training are generally fairly small. They are often still worth investing in. However, the more time that you spend on adding more training data the smaller the benefits of each additional hour spent adding more becomes. For example, a system that gets a relevant result in the top 5 search results 43% of the time with no training and 46% of the time with a 200 training queries might plausibly get up to 47% with 2,000 training queries and 48% of the time with 10,000 training queries. This can still be worth it if you are getting a huge volume of traffic and the value of each good result is high. For example, say your system is getting 100,000 queries per month and each accurate search result provides an average of $2 to your business in increased sales from satisfied customers. Each percentage point of improvement in your search accuracy is worth about $2,000 per month to your business. Getting an extra percentage point or two from going from hundreds to thousands of training instances may well be worth it. However, it is very rare to have a search system with this many users that provides such a large financial return for each valid result. If you are considering investing this much effort into relevancy training, take the time to consider costs and benefits. Estimate how much value you will get from each percentage point in accuracy and ask is it really worth weeks or months of work to maybe get a couple more. If not, just go back to User Profile 1 or User Profile 2 and follow the guidance there.

Second (as noted in User Profile 2) you need a representative sample of what real users will do with your system. The best way to do that is to deploy the system to real users without any relevancy training. Then record the queries, and grab a random sample of those queries. For User Profile 3, you generally want to grab thousands of queries rather than the hundreds you would have gathered if you were in User Profile 2.

The relevancy training capability in Watson Discovery tooling can be useful for doing the labeling for User Profile 3, as it is for User Profile 2. However, we often find that users in User Profile 3 want advanced functionality not found in the Watson Discovery tooling. Instead they prefer to set up their own system for generating labels. They use the Watson Discovery relevancy training API to do the training. For example, when labeling thousands of queries it is common to have an organized team effort. A leader divides up the queries into batches. The leader assigns batches to team members who do the labeling. The leader measures the inter-annotator agreement by assigning some batches to multiple team members. When needed, the leader takes corrective action to improve that agreement. A powerful (but expensive) alternative is the Crowd Truth approach. With that approach, all the data is labeled by many people. This is generally done using a crowd-sourcing tool such as Appen or Amazon Mechanical Turk. With Crowd Truth, you can estimate degree of relevance by the fraction of the people who label the search result as relevant.

In User Profile 3, it is actually workable to measure the impact of the work you are doing on how accurate the system is. One popular approach to doing so is a train/test split: label a big batch of data and then divide it randomly into test data and training data. To determine how much test data you need, consider the formula from the “second boring side-note for skeptics” in the previous section. For a representative sample of N test queries, there is a 95% chance that the true probability of getting a good result will be within 1/sqrt(N) of the observed frequency. If you want to measure the effect of something like relevancy training then you want 1/sqrt(N) to be smaller than several percent. For example, 4,000 test queries would mean that your measurements were precise within 1/sqrt(4,000) or about 1.6%. That may be precise enough to draw some usable conclusions.

An alternative to splitting into a train and test set is to do a k-fold cross validation. You split the data into some number of folds (typically 10). Then for each fold train the model on all the data outside the fold and test on the data inside the fold. A 10-fold cross validation requires that you train the relevancy model 10 times. That is a significant hassle in Watson Discovery because each time you need to delete all the old training data from Discovery, push all the training data outside that fold into Discovery, wait an indeterminate amount of time for the model to finish training, run on the queries in the fold, and measure the quality of results. You can automate all of this with enough work, but it is a significant undertaking, so it only makes sense for User Profile 3. As with the train/test split above, it is important to estimate how precisely you can do measurements using the 1/sqrt(N) formula. With k-fold cross validation your N is all your labeled data.

An extremely counterproductive alternative to either a train/test split or a k-fold cross validation is test on train. That approach involves taking all your training queries and testing to see how accurate the system is on that training data. Testing on training is usually extremely misleading. We have seen very small changes to a system have very large impact on the observed effectiveness when we test on the train set. Those impacts generally do not generalize to other data at all. For example, say I have 4,000 training queries and I want to measure the impact on accuracy of adding a few extra documents to my Watson Discovery collection. I know that the window of uncertainty when testing on 4,000 queries is 1/sqrt(4,000) or about 1.6%. So I would normally expect that an impact that was much larger than this reflects a real impact that is predictive of how effective the system will be for real users. However, if I am taking the test on train approach, I should not have this expectation. I might see an improvement or decline of much more than 1.6% that is only an impact on how well the model fits the training set. Such an improvement is not predictive of whether the system will do better or worse on other queries. This issue is common in many kinds of applications that do machine learning, but it is particularly severe in Watson Discovery. Watson Discovery often winds up in a state where it is selecting among alternative model configurations that dramatically differ in how well they fit the train set. Those configurations are often about equally good on real data outside the train set. So a tiny change can lead to a different configuration and very different test on train results. Do not test your Watson Discovery effectiveness using the same data that you used to train it!

Regardless of whether you do a train/test split or a k-fold cross validation, you will need to choose metrics to measure whether your results are good. One option (mentioned earlier in passing) is the percentage of queries for which there is at least one relevant search result in the top X search results. Popular values for X are 1 and however many search results fit on a single page in your application (e.g., 5 or 10). That metric is sometimes referred to as Match@X (e.g, Match@5 is the percentage of queries for which there is at least one relevant search result in the top five). Match@1 is sometimes also called “accuracy,” but other times that term is used more generally to refer to a variety of metrics. Match@X is a popular metric because it is clear and intuitive. It is a little flawed because it only gives full credit or no credit for each query. In practice it is exceptionally valuable to get relevant results to the top of the search result list and also still fairly valuable to get relevant results near the top of the list. So metrics that give partial credit generally measure business value better. Such metrics include the following:

Mean Reciprocal Rank is generally preferred if there is rarely more than one relevant document for any query (because the documents have little overlapping content).
Mean Average Precision is generally preferred if it is common for a query to have multiple relevant documents and there is no information about degree of relevance.
Normalized Discounted Cumulative Gain is generally preferred when it is common for a query to have multiple relevant documents and you know which documents are more relevant than others

In general, the same basic rules of thumb apply fairly well to all of these metrics:

If you have a representative sample of training queries, relevancy training will probably improve each of these metrics by several percent.
If you have a test set of N queries, you should expect that the value you observe on that test set for each of these metrics will probably be within 1/sqrt(N) on a test set of unlimited size.

If you measure and report more than one of these metrics, what you tend to see then is roughly the combination of these two effects. Systematic changes like relevancy training all affect these metrics to roughly the same degree. However, the 1/sqrt(N) window of uncertainty tends to affect all of them differently. For example, say you do relevancy training and then test on 1,000 queries. Your confidence interval (1/sqrt(1,000) is roughly 3%. Say that you see that Match@1 goes up by 5% and Match@5 goes up by 1% and Mean Reciprocal Rank goes down by 1%. It is tempting to try to construct some sort of explanation for why relevancy training is helping your Match@1 but not helping your other metrics. That’s the wrong interpretation. Instead, it is much more reasonable to conclude from these numbers that relevancy training is probably helping a little bit on all the metrics but the plus or minus 3% window of uncertainty for each measurement makes it hard to be very confident it is helping or to estimate how much it is helping.

Finally, consider one last technical detail for people who have gotten this far: It is common to evaluate information retrieval systems by manually labeling some highly ranked search results as good or bad. This is the approach that the Watson Discovery relevancy training tool takes. We see it works well for training data. However, for test data, it introduces a bias toward the behavior of the system that did the search. There are a variety of techniques for compensating for this issue. For example, you can evaluate on a condensed list as described in On information retrieval metrics designed for evaluation with incomplete relevance assessments by Tetsuya Sakai and Noriko Kando.

Recommended alternative to manual labeling: User labeling

The approaches above all assume that you will do your own relevancy training. An alternative approach is to get end users to do the training for you by storing records of what they do. There are two main approaches to doing so:

Record which search results users clicked on and then use some or all of those clicks as relevancy judgements. For example, you can mark all documents that were clicked on as relevant because users click on relevant documents more often. Alternatively, you can mark the last document that was clicked on as relevant because users sometimes keep clicking on documents until they find what they are looking for. There are many other variants of this idea. In practice, we find that marking all documents that have were clicked on for a query as relevant to that query works well for Watson Discovery training. Of course, not all documents that are clicked on really are relevant. However, since documents that are clicked on are more likely to be relevant the labels are still useful for a statistical model.
Ask users to provide explicit feedback indicating whether the search results are relevant to the query. For example, you can use a thumbs-up/thumbs-down widget to let users say that a result is good or bad. This can get you more reliable labels than just checking if a user clicked on the documents. However, it also has several limitations. It adds clutter to your user interface and it asks your users to do more work. Casual users rarely use this feature, so the data you get tends to represent a tiny fraction of all queries. It can be a biased fraction of the queries. You could wind up learning a model that makes the system more accurate for the highly motivated users at the expense of making it less accurate for less motivated users. Also, in some applications, users will indicate that a relevant document is bad because they disagree with the content. Such a label will be counterproductive for helping Watson Discovery learn how to recognize relevant documents.

Most Watson Discovery applications will get better results by learning from clicks than by learning from explicit feedback. However, if you have a pool of highly motivated users, then explicit feedback might be more effective. For example, if your users are employees and the system helps them do their job then they might be very good at giving explicit feedback. In this case, you may want to use that feedback as labels for relevancy training. Also, you may want to use that feedback to drive the Watson Discovery v2 curation API. You can curate any results that someone labels as good and then uncurate them if someone else labels them as bad. The curation API can be a useful complement for relevancy training. The curation API guarantees results for a single specific query while relevancy training provides general improvements for all queries but does not learn much for any single specific query.

Measuring impact via click-through rate

In previous sections, we mentioned some challenges in trying to directly measure how accurate your system is. It is very time consuming to assess results on a large enough sample size to get a statistically useful measurement. An alternative approach is to measure the percentage of queries for which the user clicks on at least one search result. In general, as your system gets more effective, you expect more queries to provide search results that users click on.

If your system is improving, you can typically expect click-through rate to go up. However, do consider the 1/sqrt(N) window of uncertainty around any measurement of N queries. For example, if you are getting 400 queries per month, then you should treat the click-through rate as an estimate that is probably accurate within 1/sqrt(400) = 5%. So if you measure a click-through rate of 62% one month and then 64% and then 61% then this variation is within the +/-5% range of uncertainty. In this case, you are not getting enough usage to tell whether the results are getting better or worse. You are getting enough usage to know that if the results are getting better or worse, it is probably doing so by less than 5% or so, which is still useful to know.

Another issue with measuring click-through rate over time is that it can conflate multiple changes. For example, if you add more content and do relevancy training at the same time, you can only measure the combined effect of these changes. Click-through rate is also affected by changes outside your system. For example, if your users’ interest change and they start asking easier or harder queries. One approach to addressing those concerns is A/B testing. With A/B test, your application randomly selects some users to get version A of your system and others to get version B. For example, you might have a Watson Discovery project with no relevancy training as version A and one with relevancy training as version B. This can provide a direct measurement of the impact of specific issues without conflating them with other internal or external changes. Even with A/B testing, you need to be aware of the 1/sqrt(N) window of uncertainty around any measurement of N queries.

Summary

IBM Watson Logo, used here to represent IBM Watson (of course)

Watson Discovery has excellent out-of-the-box information retrieval technology. It delivers close to the best effectiveness it can provide right away as soon as you put documents into it. However, if you can afford the time and effort, you can make Watson Discovery even more effective through relevancy training. Relevancy training requires a representative sample of user queries. Such a sample is best obtained by allowing real users to use your system. You can store user queries in a secure database, and then select a random sample of at least 100 (and preferably a few hundred) to label. Users who only have a few hours or days for this work should trust that relevancy training is effective as long as you train it on a representative sample of data. Trying to measure how much it is helping takes too much time and effort for those users. Users who are willing to spend days or weeks can get some smaller additional improvements by labeling thousands of training queries instead of hundreds. Such users may also want to invest the time and effort to measure the benefits they are getting. An alternative to manually training relevancy is to gather relevancy labels from the behavior of users. An alternative to directly measuring how accurate your search results are is to measure how often users click on those results.

To get started using IBM Watson Discovery to find information in documents, create an IBM Cloud account, provision an IBM Watson Discovery instance for that account. Then start adding content and searching for information!

About the author: J William Murdock is an IBM Principal Research Staff Member. He has been working on IBM Watson since its inception in 2007 in the IBM Jeopardy! Challenge, and he was the guest editor of This is Watson, the 2012 special journal issue explaining IBM Watson for Jeopardy!. Dr. Murdock currently works on making information finding more effective in IBM Watson Discovery and IBM Watson Assistant.

Acknowledgements: Thank you to Dakshi Agrawal, Mary Swift, Anish Mathur, Charles Rankin, and Will Chaparro for providing constructive feedback on earlier drafts of this article. Your input has really made this article much better than it was when I started.