My first disclaimer

Allie Hall
1 min readOct 8, 2015

--

Let me tell you a little bit about the tools that I’m working with and why they’re inherently flawed.

You’ll see me start to play around with two things:

  1. Google n-grams. For those of you unfamiliar with the Google Ngram viewer, it’s a tool that charts the frequency of words or phrases based on the number of times they appear in Google Books. The issue is that the sources pulled into the corpus are used between 1800 and 2012 (and sometimes data is even further limited post-2000).
  2. The Brown corpus. Compiled in 1960, the Brown University Standard Corpus of Present-Day American English is a collection of 500 text samples, making up a corpus of about bout 1 million total words.

Do you see a common theme here? I’m stuck dealing with old data for now. As with any research, the hypotheses I test and conclusions I draw are going to be inherently limited by the data and tools that I can use.

I’m taking a course on Corpus Linguistics through the University of Lancaster. More on this to come later, but basically I’m slowly stocking up on research ammo, bullet by bullet.

On the bright side, it’s been an amazing experience so far. The level of engagement and investment that the mentors and professor Tony McEnery show for the students has been incredible.

It’s like I’ve said before… this will be a slow journey, but I have to start somewhere.

--

--

Allie Hall

Lead customer success engineer at @textio. I measure the impact of words with augmented writing. Krav Maga orange belt. she / her