Unsupervised Question Answering

How to train a model to answer questions when you have no annotated data

Kayo Yin
Published in
12 min readOct 17, 2019


Table of Contents


Generating the questions

1. Cloze Generation

  • Obtaining the context
  • Defining the answers
  • Obtaining cloze statements

2. Translating into natural questions

  • Identity mapping
  • Noisy clozes
  • Unsupervised Neural Machine Translation (UNMT)

Training the QA model

1. The XLNet model

2. Results


Question Answering

Question Answering models do exactly what the name suggests: given a paragraph of text and a question, the model looks for the answer in the paragraph. A subfield of Question Answering called Reading Comprehension is a rapidly progressing domain of Natural Language Processing. Indeed, several models have already surpassed human performance on the Stanford Question Answering Dataset (SQuAD).


Challenge of obtaining annotated data

These impressive results are made possible by a large amount of annotated data available in English. SQuAD, for instance, contains over 100 000 context-question-answer triplets. However, assembling such effective datasets requires significant human effort in determining the correct answers. Hence, corporate structures face huge challenges in gathering pertinent data to enrich their knowledge. What if we want a model to answer questions in another language? Or on a specific domain in the absence of annotated data?

Towards an unsupervised approach

Unsupervised and semi-supervised learning methods have led to drastic improvements in many NLP tasks. Language modelling, for instance, contributed to the…



Kayo Yin
Editor for

PhD student at UC Berkeley researching AI. Now writing at kayoyin.github.io/blog