Giving BERT an English exam

Open-cloze Exercises

Published in

Analytics Vidhya

9 min readSep 20, 2019

Natural Language Processing (NLP) has advanced so rapidly in the last few months that simply by running a short Jupyter Notebook you can now get the answers to a grammar exercise in seconds. A typical exam exercise for language learners is open-cloze or gap-fill. Here a student has to find the missing word to complete a phrase. This is an ideal task for a language model; after all, you only need to correctly predict the next word. Spoiler: you can get over 80% on these exercises using an AI.

Cambridge English Qualifications

To explore using NLP on open-cloze exercises I used some graded exercises. The Common European Framework of Reference for Languages (CEFR) is an international standard for describing language ability. It grades language ability on a six-point scale, from A1 for beginners, up to C2 for those who have mastered a language. Cambridge English Qualifications provide sample exam papers for students to prepare for their exams and teacher handbooks as resources. These exams include open-cloze exercises for students with an upper intermediate level or higher (B2 First, C1 Advanced and C2 Proficiency). The task is too difficult for students with a lower level of English; typically students dislike this exercise as they feel lost with no clues to help them find the missing word. The clues are, of course, in the surrounding text.

Forwards Looking Model (ULMFit from fast.ai)

Given that a language model predicts possible next words. A first approach is to select the predicted next word with the highest probability. Taking as an example the first sentence of sample paper 1 for the First exam

‘I work __ a motorbike stunt rider — that is, I do tricks on my motorbike at shows. The missing word is ‘as’.

For ‘I work’, the language model predicts the most probable next word as ‘with’ (probability p=0.14) and ‘as’ only with p=0.03, so taking the predicted next word with the highest probability does not give the correct answer even in this simple example.

How about if we look forwards to see if predicting on the text before the gap plus the word predicted for the gap gives the next word ‘a’ as a prediction? For ‘I work with’, it does but only with probability p=0.04. ‘I work as’ has ‘a’ as a prediction with p=0.78. Applying the rule of the taking the highest probability occurrence of the desired word after the gap proposes ‘as’ as the answer. So, in this example, this approach of testing to see if the word predicted to fill the gap can be followed by the word that is after the gap works well.

This decision tree does, however, reveal a problem. Although less likely, ‘I work with a … / I work for a … / I work on a…’ are all valid responses. It is not enough to simply look at the text before the gap and the first word after. Testing this approach on the sample papers only got to about 45% correct. Looking at a few words before and the first word after the gap is not enough; more context is needed.

Looking Left and Right

This is where BERT comes in. Bidirectional Encoder Representations from Transformers (BERT) is a language model that looks both to the left and the right of a word to pre-train representations. All the heavy-lifting has already been done for us, so we can use this model with pre-trained weights for our English exam exercise. Out-of –the-box pre-trained BERT can be used to predict the missing words on the sample papers.

For a great explanation of just what BERT is, listen to the Data Skeptic podcast ‘BERT is magic’.

For each sample paper, an example is given and then there are eight gaps to fill in a text. For example, for sample paper 1 for the First exam the first word in each square bracket is the word predicted by BERT, the word or words after the colon are the accepted answers. The first square bracket is the example given to the students.

I work [ as : as ] a motorbike stunt rider — that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was [ where : where ] I first saw some guys doing motorbike stunts. I’d never seen anyone riding a motorbike using just the back wheel before and I was [ so : so ] impressed I went straight home and taught [ myself : myself ] to do the same. It wasn’t very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree [ in : in ] mechanical engineering; this helps me to look at the physics [ that : [‘which’, ‘that’] ] lies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work [ on : [‘out’, ‘on’, ‘at’] ] every stunt I do. People often think that my work is very dangerous, but, apart [ from : from ] some minor mechanical problems happening occasionally during a stunt, nothing ever goes wrong. I never feel in [ any : any ] kind of danger because I’m very experienced.

BERT completes this exercise perfectly. The results on seven sample papers are: B2 First, 1/2 on example phrases, 16/16 on test phrases / C1 Advanced, 1/2, 13/16 / C2 Proficiency, 3/3, 17/24. It’s great that the model does so well (51/63=81%).

If instead of asking for one prediction we ask the model to offer two choices for the missing word and we expect the student to select the correct word then the results improve to 87% (55/63): B2 First, 1/2 on example phrases, 16/16 on test phrases / C1 Advanced, 1/2, 13/16 / C2 Proficiency, 3/3, 21/24.

Here are the four examples where the second prediction from BERT is the correct one, all from the same text about robots:

1. Although sophisticated [ [‘robots’, ‘enough’] : enough ] to assemble cars and assist during complex surgery, modern robots are
2. dumb automatons, [ [‘capable’, ‘incapable’] : incapable ] of striking up relationships with their human operators.
3. Engineers argue that, as robots begin to make [ [‘themselves’, ‘up’] : up ] a bigger part of society, they will need a way to interact with humans.
4. The big question is this: what does a synthetic companion need to have so that you want to engage [ [‘in’, ‘with’] : with ] it over a long period of time?

Even if there is no obvious way to tell a student to sometimes select BERT’s second choice, having the two choices and knowing that the language model favours the first option may be enough for a student to identify the correct answer.

‘Although sophisticated robots to assemble’ makes no sense, so the second choice is better. Talking about ‘dumb automatons’ is a negative sentiment so ‘incapable’ is a better choice, even though the Google N-gram Viewer shows that ‘capable of striking’ occurs more frequently than ‘incapable of striking’.

Arguably, ‘to make themselves a bigger part of society’ is valid grammatically but ‘to make up a bigger part of society’ is better. If the language model had been fine-tuned on a body of Cambridge texts it may have picked up the strong preference for phrasal verbs (Definition of a phrasal verb: an idiomatic phrase consisting of a verb and another element, typically either an adverb, as in break down, or a preposition, for example see to, or a combination of both, such as look down on.)

Although Google N-grams show that ‘engage in it’ occurs more frequently than ‘engage with it’ the correct expression here is to ‘engage with’ the ‘synthetic companion’.

That just leaves six test examples where the model fails to predict the correct word. Two occur in a text about mobile phones:

And [ [‘why’, ‘what’] : why ] should they want to? [ [‘after’, ‘above’] : after ] all, the ability to send and receive emails from a mobile device means they can stay in touch with colleagues, friends and family, whether they’re standing in a queue at 1. the supermarket, downing a quick cup of coffee in [ [‘business’, ‘planning’] : between ]
2. meetings or killing [ [‘someone’, ‘people’] : time ] before a flight.

The first error of proposing ‘in business meetings’ rather than ‘in between meetings’ is a mistake that even native speakers make. In the second error, we can see why AI gets such a bad name, instead of proposing ‘killing time’ the AI suggests ‘killing someone’ or if you push it further ‘killing people’. That’s AI for you, give it an English exam and it encourages murder! Again, this is a reflection of the training data. Had the model been fine-tuned on Cambridge’s style of idiomatic, phrasal-verb-rich, safe and bland language, which is necessary for international exams, it would probably not have made this suggestion.

Phones and computers have already shown the [ [‘extent’, ‘degree’] : [‘extent’,’degree’] ] to which people can develop relationships with inanimate electronic objects.
3. Looking further [ [‘on’, ‘further’] :[‘ahead’,’forward’] ],

Here I don’t know why the model proposes ‘Looking further further’, this does not even appear in Google’s N-grams. ‘Looking further ahead,’ is the best answer.

On the C1 Advanced sample paper there are three errors:

that life can feel very daunting [ [‘several’, ‘multiple’] : at ]
times.
Apparently, many people faced [ [‘social’, ‘climate’] : [‘with’, ‘by’] ]
change respond by considering two possible courses of action,
Something simple, [ [‘or’, ‘and’] : like ]
taking another route to work

These last four examples show that there is still room for improvement.º

Possible Use Cases

What is the purpose of this exercise? My main motivation was to use NLP in an area of which I have some experience. Seeing how well BERT does on the task shows that this code can be a help to students studying how best to prepare for this exercise. It can also be useful for people preparing exercises of this style for students, highlighting where there is ambiguity in the possible answers. Ambiguity at C2 Proficiency level is a desirable feature, since at this level the students should be able to weigh up the possible answers and select the best one, safe in the knowledge that they are in the world of English exams and not the universe of language examples used to train BERT. Fortunately, this code will not help students to cheat in the Cambridge exams since all electronic equipment is banned. You will even be asked to remove the label from your plastic water bottle.

It is amazing that it is now possible to do personal NLP projects using deep neural networks on a standard PC. The open-source availability of a wealth of resources thanks to the work and compute of many smart people and big processing units means this is now within reach of everybody.

The Technical Stuff

Initially I tried using the ULMFit forwards model from fast.ai, fine-tuned on MultiNLI sentences. This model has many fine qualities, especially for classification, but is not best suited to predicting text.

The fast.ai NLP course from Rachel Thomas is an excellent introduction to NLP that makes the field more understandable and gives you a good grounding to go and explore uses of NLP for yourself.

My code used with the pre-trained BERT model is available in a single, short Jupyter notebook running Python 3.

HuggingFace generously provided the pre-trained BERT model and advice on how to deploy the model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

EDIT: 25 September 2019 With AllenNLP Interpret announced today you can now get BERT to do these exercises for you online at https://demo.allennlp.org/masked-lm