IBM Watson Factoid Assistant: A Customized Search Bar for Wikipedia

Raphael Sacks
2 min readFeb 7, 2019

--

In an era of crowdsourcing, nothing stands out in scale like Wikipedia; since 2001 it has accumulated almost 5.8 million articles, and is a go-to resource for questions and curiosity about history, geography, politics, and just about anything else you might want to learn.

What if you could have a search bar to input your questions for Wikipedia directly, instead of jumping from page to page looking for a specific fact (which admittedly, can be incredibly fulfilling yet time-wasting)?

Better yet, what if this “Wikipedia search bar” could be trained to understand specific types of questions with just a few dozen examples, and instantly surface relevant entities, concepts, and the answer from Wikipedia?

Enter the Factoid Assistant, a sample application built with IBM Watson’s Natural Language Classifier and Natural Language Understanding that surfaces information from DBpedia, a database of Wikipedia content. The full Github repo contains all the necessary code to replicate the application.

So how is the user’s query being categorized and answered? This process relies on a custom, Natural Language Classifier model to determine the intent of the query. Fear not, this AI customization takes just a few minutes, and a few dozen training examples (which we provide you)!

Using the nlc_factoid_training.csv, follow the simple steps in the documentation or API reference to train the model. This training dataset contains the classes health-condition_cause, person-birthdate, person-birthplace, person-children, person-net_worth, person-schooling, person-spouse, place-areacode, place-capital, place-completion_date, place-governor_mayor, place-height, place-population. Every time a user queries the Factoid Assistant, the question is categorized by the trained NLC model.

You are also welcome to customize the Factoid Assistant by adding new classes to the model, such as person-occupation or place-state. Just make sure to add training data, and keep these best practices in mind as well. Our provided data set is 78 training examples across the above 12 classes. Watson Natural Language Classifier is a leader in the field in this regard; it needs just 5 data points per class (10 is recommended); competitors often require 10 or more and suggest even higher thresholds for better performance.

Want to fork the code and build the app yourself? Check out our Github repo!

Watson Natural Language Classifier also has 10+ sample applications online available for live demos, code forking, and video walkthroughs.

Enjoying IBM Watson products? Give us a review at https://www.g2crowd.com!

For every review received, $10 will be donated to Girls Who Code!

--

--