1. Creating a Q-A system (Introduction)

Puneet Singh
techpsl
Published in
4 min readNov 11, 2013

Wikipedia says, “Question Answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.”
As I have taken the course of “Information Retrieval” this fall at UB, my final project is decided to be a Q-A system. Well we (me and my 3 group partners) have not decided much on features, which can make our project stand out, but for now we would like to start with a small goal. Somehow index the infobox of Wikipedia, and try to query it using natural language. My initial research suggests that IBM has already created something similar but on a very large scale, and they call it Watson.
As we don’t have enough resources in terms of man-hours (neither men nor hours, 4 people on this project and 25 days to create it), also we do not have computing resources, i.e. What IBM does with thousands of computers, we might have to do the same with 2–3 computers.
Thus it becomes clear that:

  1. We have to be limited and domain specific, for e.g. Indexing only about people, sports, technologies etc.
  2. We have to compromise on precision and accuracy of our final result.

Where IBM Watson uses very complicated architecture, we cannot afford to do that.

The architecture used by IBM WatsonFor now, with my limited understanding on this subject, I believe, if we are not working on very big data, we can ignore a few components mentioned here. Since it is mandatory to use Lucene/Solr for this project, we have decided to use a few other standard open-source tools.

After reading and researching from other places I am clear of five main components of our system:

1. NLP Engine
2. Query Engine
3. Database/Index
4. Retrieved result understanding
5. Answer generator
I will take each and every component one by one..

NLP Engine:
This component reads the question, understands it and breaks it into a computer understandable query. We will use “Apache OpenNLP” or some other NLP toolkit, to determine entities, parsing questions, generating query-able terms for Solr.

Query Engine:
Takes the Semantic Query generated by the NLP engine, and pushes it to SIREn (Semantic Information Retrieval Engine).
SIREn is an open source extension for Apache Lucene and Solr which can query RDF data indexed in Lucene. SIREn adds a new “Field Type” with a set of specific tools such as Analyzers, Query Operators and Query Parser.
RDF is Resource Description Framework, which like XML, stores data and also stores semantic information about that particular query.
Given below is an example of the RDF model of an article about Tony Benn, which says that Tony Benn is a person’s name. RDF can also stores relationship between two entities. This makes querying semantic information easier.

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
<foaf:primaryTopic>
<foaf:Person>
<foaf:name>Tony Benn</foaf:name>
</foaf:Person>
</foaf:primaryTopic>
</rdf:Description>
</rdf:RDF>

Index:
This component would store RDF, in Solr, using SIREn. We would like to use RDF data from DBpedia.


Retrieved result understanding (Answer Analyzer):
The retrieved results are ranked in the best possible manner, using disambiguation techniques.
For example, the query “What is Apple” would give two results; fruit and Software and Hardware Company. What would you choose out of these two?


Answer generator:

This component tries to form a human readable answer. For example for the question “What is Apple?” the answer should be:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software and personal computers.

Because Apple here is starting with a capital ‘A’, which indicates that it is more probable that, the user wants to ask for Apple Inc.

Tentative architecture of our Q-A system

I am really hopeful that we can create something purposeful in next 25 days, even though it is really less time to create something mammoth like a Q-A system. Let’s see where we would land in next 25 days, I would try to post more updates as we will pass by different milestones.

--

--

Puneet Singh
techpsl

Machine Learning @ Factset, ex-blogger, wannabe-writer, Science-lover, student for life. Other interests: Entrepreneurship, Photography, Politics, SelfHelp