Research Internship Recap: Neural Reading Comprehension Methods For Science QA

Hi! My name is Nelson Liu, and this past winter I had the pleasure of joining the Aristo team as a research intern, working with Matt Gardner to investigate the feasibility of using recent reading comprehension models to answer science questions. As my internship comes to a close, I thought I’d take the opportunity to talk a bit about what I did during the past 11 weeks.

I started off by implementing two multiple choice reading comprehension models, the Attention Sum Reader (AS Reader) and the Gated Attention Reader (GA Reader), in deep_qa, our library for training deep learning models for various question answering tasks. In the reading comprehension task, you’re given a passage and question about it and you have to either pick the correct answer choice (in the case of multiple choice) or pick a substring of the passage that answers the question (in the case of direct answer). As someone who came in with very little applied experience with building deep learning models, I really learned a lot from reimplementing these models in our codebase; there are a lot of tips and tricks in the implementation and training details that seem inconsequential, but are quite important to properly reproduce the results of others. I managed to replicate the numbers presented in the Who Did What paper and the Gated Attention Reader paper with the models I built.

With these models available to us, we first applied them to the task of solving multiple choice science questions. Neural models are particularly data-hungry, and our existing set of professionally-written science questions was not enough to satisfy the Reader’s appetite. To mitigate this effect we used SciQ, a larger dataset of crowdsourced science-related questions, as additional training data. Indeed, models trained on both the set of real science questions and SciQ saw large performance boosts versus models trained on only the professionally-written science questions, thus validating the quality of SciQ as a dataset and pushing the performance of the models. We also tried doing some transfer learning experiments from larger datasets such as Who Did What. Despite this, both readers still failed to beat a competitive information retrieval-based baseline, so there’s still a lot of work to be done in this area.

Model accuracies on real science exam questions when trained on 4th/8th grade exam questions alone, and when adding SciQ.

I also looked into using reading comprehension methods (specifically the recent Bidirectional Attention Flow Model) to answer direct answer questions. Since reading comprehension methods require both a passage and a question, we had to retrieve a relevant passage from a corpus (we used ElasticSearch) for the model to use to answer the question. This turned out to be the main bottleneck in the system, so we’re experimenting with more sophisticated methods of retrieval and drawing some inspiration from answer sentence selection tasks for this.

I received a huge amount of support from my mentor and the other members of the Aristo team; when I needed assistance or was blocked, they were always happy to help. I also really appreciated the transparent culture encouraged at the company — it was always interesting to chat with people working on different projects or even in different fields and learn about the sort of work they were doing. Overall, I thought that my research internship at AI2 was a rewarding experience; it was great to work on a project with direct applications to Aristo’s goal of performing well on science exams, and I had the opportunity to meet and converse, and work with a lot of smart people on a range of topics.


Nelson Liu is a student at the University of Washington, where he works on NLP research with Noah Smith. He blogs occasionally at blog.nelsonliu.me.