Corpus data (Part 2 — How to use corpus data)

Published in

A little more action research

7 min readApr 19, 2018

In the last article, I introduced corpus data and gave some insight into why it can be useful for English teachers. In this article, I’m going to give you a step-by-step guide to using corpus data, as well as revealing the results of a survey I conducted on the topic.

For the purposes of the guide, I’ve chosen the British National Corpus. (Fear not American friends, you have your own version, full of all your favourite incorrect spellings and bastardisations.) I’m also going to use the Brigham Young University interface to search the corpus, because it’s the tool I’m most familiar with.

In order to demonstrate the search tools, I needed a word and I’ve chosen sausage. Why? Firstly, sausages are delicious. And the other reason is simply that it was the first word to come into my head when I realised I needed a random word (I wonder what a psychologist would say about that…)

Let’s start with a simple search for sausage. I type it in to the BNC webpage and hit ‘Find matching strings’.

Lo and behold, I receive two sausage strings! Sausage and sausages.

Right away I discover something about the word: according to the corpus, the singular form is slightly more common that the plural form. So let’s find out more information about the singular form. If we click on the sausage link, we get a list of 497 examples of the word being used in context. Here are the first 6:

The list looks a bit complex at first, but it’s easily broken down. If we look at number 2 on our list, we can see a code ‘FY0’ in the second column, which relates to the source of the example; in this case, the example is from the Nottingham Oral History Project. The third column indicates the genre, ‘S’ refers to spoken English ‘W’ refers to written English. The ABC is just an organisational tool, so we can ignore that.

The second quote begins by talking about cod, which feels pretty random, but this randomness is to be expected because we are seeing a small stretch of language taken out of context. The context is clearly a spoken one (we don’t need the ‘S’ to tell us that), as it contains some typical features of spoken English, like er and oh. It also has an (unclear) utterance where the person transcribing the quote to written English has been unable to guess what has been said. This can be a surprisingly common occurrence when transcribing natural speech.

If we click on ‘S_Interview_oral_history’, we get a some more context, which helps us understand the utterance a bit more.

Occasionally it can still be tricky to get the gist of what’s being said, but I think we can agree here that a woman is talking about her past. She seems to be recalling trips into town to buy food with her savings, before the interviewer moves her on to a discussion about her marriage. As this is an interview from the Nottingham Oral History Project, we should not be surprised to see examples of British English ‘tuppence’ and the much-loved British cuisine, ‘spotted dick’. Also, judging by the price of fish (tuppence), this interview happened quite a while ago. We can see further features of spoken English, including dashes, which may indicate the speaker has started a word, but not finished it (other dashes may be for the purposes of anonymity, i.e. we don’t want everyone knowing the location of this bargain cod.)

So let’s summarise what we have learned from the corpus:

Sausage is a fairly common English word. (954 entries)
It’s a countable noun with a singular form which is slightly more common than its plural form.
People use the word in a variety of spoken and written contexts — mostly about food. Ticking a box when we search labelled ‘sections’ and doing a quick bit of arithmetic reveals a speaking to writing ratio of about 1:4. (It’s important to note that around 90% of the corpus data is written, due to the difficulty involved in obtaining spoken data. So while we can’t take it as the actual ratio of how it’s used, it does allow us to make valid comparisons.)

All of this is probably rather uninspiring to proficient English-speaking, sausage experts such as yourselves. So let’s substitute the word sausage for a more useful word, one we can imagine might be of interest to a student learning English. What about the word ‘alright’, which the woman used in the interview to describe the quality of something. A search in the corpus data reveals:

Alright is a very common English word. (8315)
Whereas the sausage was a nice solid noun, alright can be used as an adjective (as above), an adverb ‘can you hear alright?’, and an exclamation ‘Alright, yeah, ok.’.
Unlike sausage, it is used far more in speaking than in writing around 24:1. It is also used in a variety of different contexts: to say that the quality of something is satisfactory, or done in a satisfactory way; to say something is permissible; to emphasise certainty; or asking for agreement.

Now we are getting somewhere and can see some of the potential benefits of using corpus data. It’s easy to imagine students that would have a receptive understanding of the word ‘alright’, but how often do they use the word? Do they use it in the various ways described above? Perhaps a dictionary, or Google, could give us much of the above information, but remember that these results are real life examples of spoken or written English. We can see them contextualised beyond just the sentence level. Basic dictionaries don’t offer this.

Why not just google it? Well, Google is used by non-proficient English speakers. It’s also used by kidz wot mite uz kool spellingz like ‘alrite’. I’m not suggesting that either of those are problems; alright has already undergone a spelling change from all right, (and the corpus data suggests that the one word, single-l-spelling option is now more common, about 4:3). But we want to be careful when selecting what we choose to teach our students. They need practical information about which spelling is more common, and corpus data can provide this. For example, given that 4:3 ratio, I would probably teach my students ‘alright’ and ‘all right’, but not ‘alrite’, which has no entry. Considering how infrequently we write the word, perhaps the spelling doesn’t really matter. Even when we are writing it, we are doing so in informal contexts, probably using instant messaging, and there is less pressure to conform to spelling rules.

Another benefit of corpora is the English of spoken English (which search engines like google currently don’t.) People often fail to appreciate how important this is and just how different spoken English and written English are. The ers and ohs from our interviewee are just one small example of these differences. Corpus data has revealed even more significant differences which might suggest that there are changes to be made to how we teach the speaking skill. After the last article, Anna Blas posted this great McCarthy video on the ALMAR Facebook group where he gives some nice examples of this, as well as introducing the term ‘e-grammar’ to refer to the unique way we talk to each other online.

In the next article, I’m going to look at some of these differences, along with some more discussion about dictionaries, search engines, and tools related to corpus data. We will also look at practical applications for these in the classroom. Remember, here we have only looked at the very basic use of one corpora (BNC), there are many other ways in which it can be used and many other corpora out there, and that’s something we’ll look at in the next article too.

I’d like to leave you with the results of an informal questionnaire I recently conducted on the ALMAR Facebook group.

Hopefully, this article, and the previous one, have addressed the brave few who responded ‘A’. They should now be clear on what corpus data is, why it might be useful, and how to use it.

If we look at the delicious battered sausage ‘D’, compared to the spiral sausage ‘C’, we can see that the people who use corpus data for teaching have found ways of applying it to their classroom practice, as well as using it in lesson preparation, and I’m going to share some of their ideas in the next article. All that remains is the towering dry-cured sausage ‘B’, which indicates that the vast majority of teachers who responded to the survey know about corpus data, but don’t use it. My aim is to cut that sausage down to size.

A huge thanks to everyone who participated in the survey. Click here for part 3, where I present ways to use corpus data in lesson planning and in the classroom.

Corpus data (Part 2 — How to use corpus data)

Written by Scott Donald