Corpus data (Part 3— Advanced settings & classroom tools)

Scott Donald
A little more action research
18 min readSep 25, 2018

By Scott Donald

In the first part of this series, I introduced corpus data and highlighted its importance. So if you don’t know anything about corpus data, then click here to see what it’s all about. In my second article, I revealed the results of my survey on corpus data and provided a basic guide on how to use it (using sausages). While both of these articles included new insights into corpus data, they were primarily written for people new to the topic. This article is for those who know the basics, but want some ideas on how to use corpus data when planning lessons or in the classroom. Welcome to corpus data — advanced settings.

There is no shortage of corpora out there on the internet: some of the websites hosting the data look like they were designed back when Windows 95 was cutting-edge technology. Their pale pink webpages and bare html code remind me of websites I used to visit in the 90s to read dirty jokes or find cheats for my computer games (because that’s about all that was on the internet back then.) Bearing this in mind, I’m going to spare you the boring stuff, and instead prioritise corpora and related tools which are clear, user-friendly, and free to access.

To ensure you are getting exactly what you need, I’ve split the article into two sections. In Back to BYU, I go further than the basic search tools covered in my previous article and explore some of the other features which I think are useful for language teachers. And in Related resources, I provide links and descriptions of other practical tools connected with corpora. This part should be more accessible to those who haven’t read the previous articles.

Essentially, the first section is the meaty turkey. The second is the tasty trimmings. Bon appetite!

Back to BYU (advanced settings)

In my previous demonstration, I used the BYU interface to scour the British National Corpus (BNC), so let’s be fair to my American friends and have a go on the Corpus of Contemporary American English (COCA).

This time, when I reach the main page I’m going to select compare, rather than list. I’m going to again stick in my favourite keyword sausage, and underneath, I’ll add its breakfast-rival bacon.

I hit compare and I can reveeal that sausage is less common in American English than bacon (0.52:1.90). I double-checked a couple more corpora, and they preferred bacon to sausage too. So it seems people prefer talking and writing about bacon. As a sausage-lover, I’m going to assume this is a mistake, and that references to Francis Bacon have skewed the results. So let’s see which one collocates better with the word tasty. If I were looking at one keyword, I could click on collocates tab, but as I’m looking at two keywords and their collocates, I’m going to stay on compare.

  • I clarify that I’m referring to sausage and bacon as nouns (not really necessary here, but it would be with words that have different forms).
  • I write tasty, the specific collocation I’m looking for, but I could also be more general and add N or ADJ here (there’s a help button for these codes).
  • I increase the position to 9, because there are fewer options if we only select 1, i.e. we only get examples where tasty and sausage are right next to each other. Like this one:

[Y]ou won’t want to think about big brown eyes and a certain song about you-know-who’s very shiny nose when you are munching this tasty sausage.

Poor Rudolf! Anyway, here are our results with the position set at 9:

Yes! The green box indicates that sausage collocates more with tasty than bacon does in this American English corpus. A victory for sausage supporters! The numbers also give you an idea of the frequency. Notice how tasty is in squared brackets? That’s because I selected the options tab, then group by, then lemmas. This means I’ll get results for tastier, tasted etc. The other tabs have other helpful functions e.g. in sections we can filter the results by genres (spoken, fiction, newspapers) or by year of publication — this can be useful for spotting language trends. Similarly, Chart lets us compare the genres of single words more easily.

Apparently sausage isn’t a very academic topic - says who?

KWIC is (Keyword in Context) and gives some nice examples of the patterns in which a word is used in the form of concordance lines:

Notice how the different parts of speech are highlighted with different colours? This is practical information for students about how they can use a word in a given sentence, or even for teachers who aren’t sure. Funnily enough, language teachers aren’t the living dictionaries that our students sometimes expect us to be. We don’t actually have in our heads a complete knowledge of *all* the words in English and how exactly they are used.

Here’s a common situation. A student says ‘He recommended me to try the fish’. And you say ‘No, you can just say he recommended the fish’. The student nods, maybe writes it down, and then asks you ‘Always, after recommend is a noun?’ and your brain fizzles with all the possible combinations and anticipates your students follow-up question about which of those combinations sound the most natural, or which are wrong. It’s a minefield.

Personally, I tend to teach recommend+gerund and I do this for a couple of reasons:

  1. It’s the most natural sounding to my ears (a good starting point).
  2. It’s one students often avoid in favour of less natural sounding options.

However, the teacher-on-the-spot can now do some quick research using these tools above and dazzle their student with something more than their gut feeling about how words are used.

This gives a whole list of words.

The results show that That gets pole position, likely because it occurs more than any one specific verb. Then, aside from 3 and 4, we have a clear winner: the gerund (which can also follow that). The top 100 list of collocates continues in this way, which means I can tell my inquisitive student with confidence that yes, it might be possible to follow recommend with a subject/object pronoun, but it is rarer and only used in more specific circumstances than the gerund (which I can check by clicking for examples).

In a nutshell, you can use these tools to back up your gut feelings with research. Trust me — students value certainty.

A more modern version

Now, as every good teacher should know, we can also learn from our students. For example, I recently heard my young students talking about creepers. I correctly assumed they weren’t discussing vegetation or anything more sinister, so let’s see what corpus data can tell us about it. To do this, we will have to ignore these carefully compiled American and British corpora of written and spoken texts which cover several decades, and instead find something more modern.

will take us to other corpora accessible using the BYU interface. If we click on the iWeb corpus, we find this description under the overview tab

The iWeb corpus contains 14 billion words (about 25 times the size of COCA) in 22 million web pages. It is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English.

Sounds like a good place to look for a creeper. A quick search reveals that a creeper is a character from the popular computer game Minecraft. But if I really want to be down wif da kidz, and use this word in a natural way, I could search for adj creeper to discover that one common adjective with creeper is pesky. As in:

Build houses to keep those pesky creepers out!

A Google Images search which did very little to further clarify what a ‘creeper’ is.

Another excellent feature of the iWeb corpus is that they have a special frequency list of the top 60,000 words. We can do some unique things with this.

Searching the iWeb corpus for sausage tells me that it ranks 6788 in the 60,000 most frequent words. (Indicating that it’s not just me who has an unhealthy fascination with the word.)

The next two boxes are also really interesting and can make you or your students instant lyrical geniuses: we can search words that rhyme with what we are searching for. (I had to switch to approx rather than exact to get rhymes for sausage and bacon: forage and maiden!) And/or we can search by syllables, double clicking to search for word stress.

Searching for four syllable words ending in -ation, with the stress on the third syllable

And if we simply click on search for word, we get loads of details about the word.

We get the definition, we can click underneath to be taken to some pictures, or click to hear it pronounced in different ways. Then we can select a language we want to translate it into, and choose from a range of different translators. Everything on the right and at the bottom is pretty self-explanatory, and underneath you have the top websites which use the word sausage, and some more concordance lines showing the word in use. Also, and very importantly, I’d say these pages are fairly student-friendly.

The benefits of incorporating the BYU interface into your teaching are numerous. In particular, it could help you avoid those awkward moments I’ve mentioned above when students are attempting to use you as an on-the-spot dictionary and you’re forced to try and think of all the contexts in which the word is used and all its collocates. BYU and its corpora are ideal for this purpose. All you need to decide is how and when to use it with your students.

You could pre-empt the on-the-spot feeling by looking up words while preparing the lesson, e.g. from a text that you think students will struggle with. Or if this isn’t possible, and a student has hit you with some emergent language that you hadn’t planned to teach, you could discreetly look up a word while your students are busy (I do this quite often). But of course, there’s no need to be using it secretly: you can show the students the BYU tool and encourage some learner autonomy by getting them to use it as a resource in class or in their own time.

Another practical usage of the BYU is for gapfills. If you have ever tried to make one of these exercise yourself, you will probably be familiar with your brain’s tendency to come up with unnatural and/or boring sentences:

John doesn’t like __________ football. He prefers watching it.

Why not use the concordance lines for your gapfills? Simple, quick, and students might find it motivating to know that they are examples of real written or spoken English. (Obviously, you can use other corpora tools aside from the BYU for this as well, Netspeak is a particularly easy-to-use one.)

Teachers can also use the BYU tools to support their own knowledge of how language is used and add to existing language taught in coursebooks. In my most recent back to school lesson, I’ve been getting the students to use modal verbs to make up rules for the classroom — pretty standard stuff. However, I’ve got a couple of C1 classes and I was reluctant to change the entire lesson just because have to and must are a bit basic for advanced students. So I asked myself if there were any other modal/semi-modal expression that I might use and which I could shoehorn into the lesson, and I came up with wouldn’t dare and needn’t bother. To check these weren’t just quirky little phrases specific to me, I asked a few friends if they’d say them. But to really feel confident about their usefulness, I decided to check the collocations using BYU, and while they were nowhere near as common as must and have to, I got decent hits for both these expressions. Taking the time to search for these expressions saved me time in the long run, and was rewarding in its own right when I got fun examples back from students like: I wouldn’t dare go skinny-dipping.

Thus concludes my ideas for the BYU, but there are of course other features of the tool and the possibility of many other ways we can use it in teaching. I’d love to hear any others that you have tried, but first let’s have a look at what else is out there.

Related resources (and classroom tools)

Google

Is there anything this internet giant cannot do? I’m the type of person who googles facebook because I’m too lazy to type the wwws and the .com. I also use Google instead of dictionaries too, often because it’s quicker than going to a specific online dictionary’s website. Sure, the results aren’t always as helpful in the classroom as, say, the Cambridge Learner’s Dictionary, but there are some other interesting features.

To see these, we just need to search google by typing something like creeper definition:

So we get: some phonetics; an audio sample in received pronunciation; the word type(s); and four definitions. One of these definitions was new to me (4!?), but notice that Google is missing our trendy Minecraft definition of creeper (see BYU, above). We also have the option to quickly translate it, which is handy, but as you can see with a word like creeper, Google and its translate software often fall short of other more detailed online dictionaries. The graph of use over time is nice too, a very visual way of displaying to students how words have fallen in and out of fashion. This is good for a word like quarrel, which students often find in classic literature, but which has fallen out of favour. A final feature of Google’s searches if that they often give you the etymology of a word too, something the geekier students and teachers might find interesting.

Speaking of geek, let’s take it and search for it in Google’s actual corpora tool Google Books. The corpus was also created by the BYU, and has many of the same functions. Its corpus is made from the millions of written texts available through Google Books. You can also choose from British English, American English or Spanish corpora. Here’s the American English results for geek, which, unlike quarrel, has become more popular in recent years.

Sorry, nerds…

There are even more options for different languages available through Google’s Ngram viewer. This handy tool allows you to quickly compare different words like geek and nerd.

However, Google, despite its mighty power, also has its limitations: it doesn’t have the examples from spoken English like other corpora we’ve looked at, or the examples from websites like the iWeb corpus. What about other online dictionaries then?

Online dictionaries

Dictionaries are probably deserving of their own article, but as many respondents to the survey mentioned them, I think I need to address them here. There are some fantastic online dictionaries out there. In Spain, where I teach, Wordreference seems to be the most popular among students. It’s got plenty of great features, including the ability to listen to the Scottish/Irish/Yorkshire/Jamaican/US Southern pronunciation of a word, but a click on in context will take you to Google News, which just isn’t the same as corpus data.

There are other areas where Wordreference doesn’t quite measure up to the BYU too, and the same goes for other online dictionaries like Linguee and Glosbe. They have some great features, user-friendly pages, and I like the fact that they pull together information from all different resources, but there are issues: Wiki entries appear quite often, and warnings about sources that are not reviewed or checked. I’m not trying to put people off using these dictionaries, but I’d recommend using them *as well as* corpus data. I’d personally like to see teachers get their hands a bit dirtier, and use corpora and related tools to probe a bit further than these online dictionaries.

Wordcount

If Google’s tools and online dictionaries aren’t snappy enough for you, why don’t you try Wordcount, a tool I found through the Youtube science series VSauce and its video on The Zipf Mystery (an interesting phenomenon with its own implications for language learning). Wordcount is a minimalist tool which uses BNC data to quickly and sleekly tell you how common a word is in the English language by giving you its rank. I featured it on a recent post on the ALMAR Facebook group and suggested some fun comparisons you could quickly make, e.g. your name and your partner’s name (3032 vs. 13849 — I win!) But for the classroom, it could again be used to show which words have fallen out of favour. Or, my particular favourite for this one is to show which words are extremely rare. This often comes up when practising Word Formation (FCE/CAE/CPE part 3). I have various classroom games which invariably lead to students trying their luck with really obscure words and affixes, e.g. disagreeably.

A quick search on Wordcount tells us that, although disagreeably certainly exists in the corpus, its ranked 75,473 in the list of 80,000 words. Useful for the students to know? I’m not convinced. Maybe for a C2 student, but for lower levels, or the purposes of the game we could simply choose to disqualify anything under the 50,000 mark.

Fluiddata

Not strictly corpora, but definitely of the same ilk, Fluiddata is an advanced search tool which uses natural language processing techniques to search through podcasts for data. For our purposes, this can be used to find specific words and phrases being used in context. Take my wouldn’t dare example (from BYU, above) and put it into their search bar and we get 13,168 results. The first of which, is from a fitness podcast by Coach Becks — a Geordie (from Newcastle, England).

We can listen to the whole podcast, or simply hit the skip button to be taken to the exact moment the phrase is used.

The main uses of Fluiddata for me, other than just giving the word in a real-life context, would be, firstly, to expose students to other varieties of English. I believe this is something that we should be doing more generally, i.e. making students aware of differences in things like accents. My Scottish pronunciation of dare sounds nothing like Coach Becks’! Secondly, and more specifically, whether you’re a ‘native’ or ‘non-native’ you are quite likely to use phonological features which differ from ‘standard’ forms of pronunciation (like RP). This can be an asset to your teaching, but it doesn’t mean you should ignore RP, or any other dialects that your students are likely to come across. In fact, in most cases, it is futile to try and predict what accents your learners will come across in the future, whether through real-life encounters or in exam listening texts. I think, therefore, that we should be giving them as much varied input as possible — like Coach Becks’ Geordie pronunciation of dare.

Final thoughts on Englishes, E-grammar and Spoken Grammar

As well as corpora allowing us to access American English and British English, we also have other popular forms of ‘native’ English corpora like Australian English. Who knows, maybe in the future we will have corpora for other Englishes like Spanglish, Denglisch or Chenglish. For now, however, there are other ‘Englishes’ which perhaps deserve more of our attention namely: e-grammar and spoken grammar.

The internet has provided an unprecedented shift back to written English through emails and instant messaging services. Consider for a moment just how much of your day you spend sending written messages people compared to what would have been the norm 20 years ago. Then think about how different Whatsapp English is to letter writing (an almost extinct practice for many people): the words you use, the punctuation you use, the emojis, the dropping of pronouns. In many ways, your e-grammar is more similar to your spoken English. McCarthy mentions e-grammar in the talk (featured in my previous article), but a quick look online or through coursebooks for material on ‘e-grammar’ will leave you with very little information. You might get a few more hits for a similar term, netiquette, but not many, and few of these will be related to language teaching. Why aren’t we teaching our students this? Has technology moved too quickly for material writers? Perhaps…

We are going to need something like the iWeb corpus to help us understand how language has evolved online, and how it will continue to do so.

The other English that has been highlighted by corpus data is spoken grammar. It has been around a lot longer than e-grammar — millennia , in fact— but its importance was properly established around 20 years ago by Carter and McCarthy in their 1995 article. In the article, the authors discuss their early findings on spoken corpus data and what they had revealed: that the difference between what we say and what we write is much bigger than we had imagined. This is something I briefly covered in my first article regarding the modal verb must, (but this was just scratching the surface.) Since their original article, McCarthy, Carter and others have continued to explore what this difference between speaking and writing really means. All this despite some significant pushback from the grumpier ends of academia: prescriptivists and traditionalists unhappy about allowing language teaching to be driven by what people are actually saying (perish the thought!). But as well as the aspects of ‘traditional grammar’ which McCarthy suggests coursebooks are ignoring, his work also stresses the dialogic process of conversation, i.e. how we work together with other speakers in order to create meaning by finishing each others sentences (something I’d say we also find in our e-grammar too.)

In fairness to materials writers, they have done a better job of incorporating spoken grammar into their coursebooks than they have with e-grammar. Books like the Speakout series have pages with language which better resemble spoken English, e.g. through fillers and back-channeling like really, not at all, no problem. But during a recent email exchange and webinar with McCarthy, I asked him if he thought that coursebooks weren’t going far enough to utilise corpora (aside from his own series, of course) and he agreed.

My hope is hope is that by raising awareness of corpora and tools related to them, teachers won’t need to wait around for coursebook materials to provide realistic descriptions of language. Through these tools we can get our hands dirty and find out more about real language ourselves and start using it to better plan our lessons and help our students in our class. We can even use corpora research and tools to develop our own activities. Here is an example of some activities I adapted from an article on spoken grammar by Amanda Hilliard. It focuses on the use on fillers like I see, uh-huh, er as well as several other important features (answer key and teaching notes included). For those who came to my ALMAR talk, it’s the one I featured in the session.

Here’s the pdf. If you use it in class, let me know how it goes!

I hope you’ve found this ALMAR series on corpus data useful: let me know by clapping, commenting, or following the blog on Facebook. Until next time, keep your ears to the ground, because if the webinars, conference talks and recently published books, are anything to go by, I’d say the rise of the corpus is well on its way.

Thanks

Thanks again to the people who responded to the survey, and my friends/colleagues for their support, ideas and resources. Thanks also to McCarthy for his friendly advice and suggestions for additional materials.

Bibliography and further reading

Carter, R. & McCarthy, M. (1995). Grammar and Spoken Language. Applied Linguistics, 16 (2), 141–158.

Kilgariff, A. (12/3/2014) British Council: Corpora in English Language Teaching. Available at: https://www.britishcouncil.org/voices-magazine/corpora-english-language-teaching

McCarten, J. & McCarthy, M., Edited by Jones, C. (2018) Practice in Second Language Learning. Cambridge: CUP *Recommended to me by McCarthy*

McEnery, T. & Xiao, R. (2010) In: Handbook of Research in Second Language Teaching and Learning. London & New York : Routledge p. 364–380. 17. Available at: http://www.lancaster.ac.uk/fass/projects/corpus/ZJU/xpapers/McEnery_Xiao_teaching.PDF

O’Keeffe, A. & McCarthy, M. (2010) The Routledge Handbook of Corpus Linguistics. London: Routledge

O’Keefe, A., McCarthy, M. & Carter, R. (2007) From Corpus to Classroom. Cambridge: CUP *Particularly useful for teachers*

Timmis, I., (2015) Corpus Linguistics for ELT. London: Routledge *Recommended to me by McCarthy*

https://corpus.byu.edu/

http://www.corpora4learning.net/resources/materials.html

www.fluiddata.com

www.wordcount.org

http://www.netspeak.org/#i

https://www.wordreference.com

https://www.linguee.com/

https://glosbe.com

https://www.cambridge.org/us/cambridgeenglish/catalog/adult-courses/touchstone

googlebooks.byu.edu

https://books.google.com/ngrams

--

--

Scott Donald
A little more action research

EFL teacher and CELTA trainer, always eager to learn, his main motivations are his love of teaching, training and stealing other people’s ideas.