Our mission at Glose is to make reading better, which starts by ensuring that readers understand what they read. In this post, I will be presenting a system that automatically generates a comprehension test for any input text. The automatic part is particularly important given the size of our corpus (1M+ books), which does not allow manual test writing.
Text comprehension tests
A text comprehension test is a set of questions regarding a text, that allow to evaluate reader’s text comprehension. For example, for this paragraph of The Little Prince, which does not lack irony:
a text comprehension test could be:
- Which job did the author originally choose?
- What did the grown-ups advise him to do?
- Which job did he finally choose?
- Was what he had studied useful for this job?
However, generating such questions and correcting them is very challenging and still an open research problem (, , ,  or ). For a start, we decided to only generate a specific kind of tests, called cloze tests.
Cloze tests. A cloze test is a text where some words are hidden. Usually, 4 answer propositions (1 true, 3 wrong) are given for every hidden word. Here is an example:
Generating and correcting cloze tests is easier than generating and correcting other types of tests where questions may be open, while still giving a good evaluation of text comprehension. However, there still are some difficulties:
- Which words to hide? Ideally, we would like to hide a set of words that is small and contains the most important words for text comprehension.
- Which distractors (i.e. relevant incorrect words) to propose? We can neither propose words that are obviously wrong, nor words that could be correct in the context (e.g. synonyms).
Now that we have explained the objective and the main difficulties, let us dive into the technical solution that we have developed.
Step 1: Which words to hide?
The first step in making a cloze test consists in selecting which words to hide. Here is the process we follow:
Step 1.a. We have decided to only hide common nouns because, along with verbs, they contain most of the meaning of a text.
Step 1.b. We give an importance score
i to every common noun, based on the assumption that:
The more difficult it is to guess a hidden word, the more important it is.
or to put it differently:
The easier it is to guess a word after hiding it, the less important it is.
In practice, we use BERT (), a deep neural network developed by Google, to predict the nouns after hiding them (more precisely, we use this Pytorch implementation). BERT was in part trained to infill cloze tests. This makes it an algorithm of choice for the task we are describing here. Let us see how we can use it to give an importance score to every common noun in this sentence:
s = “I love playing tennis with my cat.”
The first common noun is
tennis. If we hide it (or mask it), the sentence becomes:
s_tennis = “I love playing [MASK] with my cat.”
Then, when we feed
s_tennis to BERT which tries to predict the masked word. For this, it outputs a prediction score
p_w for every word
w in its vocabulary:
The higher the score of a word, the more BERT thinks it is the correct one. From these prediction scores, BERT misprediction of
tennis can be evaluated with the following formula:
(max p_w) — p_tennis. Hence, following our previous assumption, it gives the definition of the importance of
tennis in the sentence:
i_tennis = (max p_w) - p_tennis
i_tennis = 3.205.
The same process can be done for
cat, the second common noun of the sentence, i.e. :
- Hiding (or masking)
s_catto BERT and retrieving its prediction scores for every word in the vocabulary.
i_cat = 8.327.
In this example, a higher importance score had been given to
cat than to
tennis because it is a much more unexpected word.
Step 1.c. We remove common nouns that are not important enough, i.e. common nouns with an importance score lower than a fixed threshold
i_min. Indeed being able to guess them is not a good indicator of text understanding. We experimented with different values of
i_min and ended up taking
i_min = 2.5 because it gave us the best filtration.
Step 1.d. We only keep the
X best common nouns, and finally hide them. Because we do not want too many words of the text to be hidden, we limit the ratio by taking
X = number of words x r_max. Again, we experimented with different values of
r_max = 0.05 because it gave us the best results.
Step 2: Which distractors to propose?
The second step consists in proposing distractors for every hidden word. We want them to be neither trivially wrong, nor correct in the context (e.g. synonyms). Let us use again the sentence
s_tennis = "I love playing [MASK] with my cat." to present how we propose distractors for the hidden word
Step 2.a. We only keep predictions that are not the correct word or its singular / plural. If the correct word is among the distractors, then twice the same word will appear in the propositions, making it obvious it is the answer. If the singular or plural of the correct word is among the distractors, the answer also becomes obvious.
Step 2.b. We only keep predictions with the same casing as the correct word. The casing of a word is either lowercase when all the letters of the word are lowercased, or title when the first letter is uppercased and the others lowercased, or other. We do this not to get distractors that are trivially wrong (e.g. if the hidden word is the first one of a sentence and the distractors are not in the title casing).
In our example,
tennis is lowercased, so we will only keep distractors in this casing.
Step 2.c. We order the remaining predictions by their prediction scores in decreasing order.
In our example, the best remaining predictions when hiding
Step 2.d. We keep the 3 predictions with the best scores that are constantly spaced, i.e. we take the prediction with the best score,
p_1, then the prediction with the best prediction score,
p_2, such that
p_2 < p_1 — p_gap, then the prediction with the best score,
p_3, such that
p_3 < p_2 — p_gap. Spacing predictions prevents having distractors that are synonyms because synonyms usually get the same prediction score. It also helps to have distractors with a higher diversity of meaning. We experimented with different values of
p_gap and ended up taking
p_gap = 2 because it gave the best results.
In our example, the 3 predictions with the best scores spaced by 2 are:
To sum up, the process we follow to generate cloze tests is two-fold:
- Selecting which words to hide. The main idea is to assume that the more difficult it is to guess a word after hiding it, the more important it is.
- Proposing distractors for every word to hide. The main idea is to take the predictions with the best scores that are constantly spaced.
This process leads to cloze tests that are satisfactory, although not optimal. Here are possible improvements:
- A new deep neural network, XLNet (), developed by researchers from CMU and Google Brain, has been released one month ago. It outperforms BERT on numerous tasks including cloze tests infilling. Replacing BERT by XLNet could lead to more relevant predictions. However, XLNet is only available for english language in contrary to BERT that has a multilingual version.
- A new paper (), from Facebook researchers, presents a way to turn a cloze test into a series of questions. Adding it to our process could lead to an improved version of cloze tests where hidden words are replaced by questions.