Corpus data (Part 1 — What is it, and what can we learn from it?)

Published in

A little more action research

5 min readApr 12, 2018

Corpus data may sound like something from a CSI series, but it’s not. It’s actually a collection of written or spoken language, which can be used for a variety of reasons, from helping to compile dictionaries, to providing insight into how language is actually used.

Now, if you are the type of person who believes that language comes from a strict set of grammar rules which people should all obey (a prescriptivist), then you’re probably wondering why on earth anyone would be interested in a collection of what people say. People are either following standard grammar rules, or they are making mistakes, right? Well, I hope it comes as a shock for you to realise that some of the earliest grammar rules that we know of were written for the Sanskrit language (presumably on palm leaves) around 6–5th century BCE, and they didn’t appear out of thin air, but rather, were created through the analysis of lexical lists, i.e. corpus data. This should be a refreshing reminder for prescriptivists that language preceded grammar rules, and not the other way around.

We have now swapped the palm leaves for computers, which is obviously a far more convenient way to store large lexical lists, but what is it about corpus data that continues to make it relevant — haven´t we more or less figured out all the grammar rules and decided what all the words are by now? Well, not exactly. Because, whether the prescriptivists like it or not, language continues to evolve. And if language is evolving, then the descriptions of language must evolve too.

All very interesting, I hear you say. That must be heaps of fun for the writers of dictionaries and grammar textbooks, but it´s not really relevant to me. Here is why I disagree. Let’s take the example of must from the final line of the last paragraph. ‘[D]escriptions of language must evolve too.’ If you are an English language teacher, you will recognise that I’ve used the modal verb must to indicate of strong sense of obligation. You can probably think of several coursebooks that you’ve used for your lessons which state ‘must is used for strong obligation.’ It probably appeared on the page next to have to and maybe even said something about must being used more for internal obligation (I personally feel obligated to do something) and have to for external, (e.g. something your boss has told you to do.)

One of my favourite grammar lessons is on modals of obligation and usually ends in a game where students are arguing over things they mustn’t and have to do in situations like being in a library, church, school, or on a nudist beach. Over time, I began to drop the internal/external bit from my teaching. I was finding that the examples I was using often didn’t feel very natural. In fact, I’d also recently started to have the suspicion that I didn’t use must anywhere like as much as used have to, and I noticed my students didn’t either. I started asking them about it and forming theories: was it because they don´t have pure modal auxiliary verbs like must in their language, whereas they have an equivalent of have to; was it because have is one of the first words students learn and it feels more familiar to them, or is it a phonetic thing, is have to easier to say? Well, whatever the reason, it turns out that I was right to be suspicious (and I absolutely love being right about things, so I´m going to take the rest of this parenthesis just to savour the fact I was right…Ok, I’m done.)

The main man when it comes to analysing corpus data is Michael McCarthy, and here is what he and his colleagues had to say about must from an analysis of spoken corpus.

[O]n average, only 5 percent of all its uses are connected with obligation…Another 5 percent are in expressions such as I must admit and I must say. But the overwhelming majority of uses of must are in the “predictive” statements such as that must be nice, you must be hungry, etc. (McCarthy, 2013, p.6)

Only 5 percent! My final externally assessed lesson on my advanced teaching course had bloody must for obligation in it. I’m glad I didn’t come across that statistic while writing my background essay, or I might have had a meltdown.

So what are the implications or teaching it? Well, that is for you to decide. Data is data. Personally, I think it‘s clear that, as must is such a common English word (the 224th most common), we should definitely be teaching it, and it seems logical to teach its three common uses, but perhaps the frequency of these uses is important information for the students. Must may still be relevant to a lesson on modals for obligation, but we might not want to spend an inordinate amount of time forcing students to use it, or have them doing exercises where they have to decide whether they should be using must for internal obligation and have to for external obligation. Because (if that is even a thing) it’s only true for a fraction of a percentage of spoken English.

The above analysis of must is a good example of why corpus data is relevant to our teaching. It comes from an informative little book called From Corpus to Coursebook which you can download for free. McCarthy and others have also published their corpus data findings in many papers, and they have even created a blended course for students based on the Cambridge English Coprus.

But the ALMAR blog isn’t just about keeping you updated on current research related to English teaching, it’s about encouraging you to get your hands dirty and conduct your own research. Several corpora out there are free and easy to use and I think we — as teachers — should be doing this for ourselves. So…

Join me for part 2, where I present a how-to guide on searching corpus data and reveal the results of my corpus data-related poll.

Biblography

Coward, H. G., Raja, K. K. and Potter, K. H., (1990)The Philosophy of the Grammarians, Volume 5. Motilal Banarsidass Publishers

McCarthy, M., (2013) From Corpus to Coursebook, Touchstone

Corpus data (Part 1 — What is it, and what can we learn from it?)

Written by Scott Donald