The Most Frequent Syllables in English

Estimations with Syllabified CMU Dictionary and COCA

David Rosson
Linguistic Curiosities
4 min readOct 30, 2014

--

Frequency data at the word level (of varying specification and quality) are commonplace. There are many resources on letter frequency and n-grams at the grapheme level. It’s also easy to find analyses on the level of phonemes (e.g. Kessler & Treiman, 1997). Similar data for syllables, however, turned out to be quite elusive. So I decided to run my own stats on syllables.

What’s the use of this? The initial motivation was related to SLA research, where lexical frequency is already used to direct vocabulary building. Observing the difficulty many learners have with phonotactics, I thought, why not go one step further, and start from smaller “building blocks”? Then we can iteratively build up speech from phonemes to syllables to words to phrases. It would make sense to have the “basics” mastered before moving onto more complex productions. It naturally follows that the training should focus on the syllables that are most common.

Research in apraxia of speech lends some support to this idea, suggesting a substrate-specific “mental syllabary” system for speech motor planning and execution that’s impaired in patients of apraxia (Aichert & Ziegler, 2004). I also vaguely remember an intervention involving repeatedly reading out syllables to improve reading skills. At least it worked in Finnish (Huemer et al., 2010). The theory is that the fast recognition of syllables facilitates access via the phonological route, which connects to meaning.

Since Finnish orthography is entirely transparent, even I can read out a long Finnish word I’ve never seen before without knowing what it means. But for a Finnish speaker, the sighting leads to recognition, and to meaning, only then it comes back out as partly pre-stored and phonologically coordinated utterances. Which is also why their production sounds more natural. It would be great if these syllable-based interventions transferred to listening comprehension in SLA. Who knows…

Method

The phonemic transcriptions come from the CMU Pronouncing Dictionary, which lists unlemmatised word forms (nearly 130k entries) alphabetically. When you look up a word form, you see the corresponding sequence of phonemes in Arpabet, with phoneme boundaries marked by a whitespace.

The syllable boundaries were determined by automatic syllabification (Bartlett, Kondrak & Cherry, 2009). The authors report a 98% agreement with a “Golden Standard” (which I imagine has been manually checked). Their syllabification was based on CMU 0.6, so that’s the version I used.

The frequency data come from the Corpus of Contemporary American English (COCA). It appears that you would have to register, and in some cases pay, to get the actual data, but some time ago Mark Davies himself sent out a “free download code” for a dataset that contains around half a million unlemmatised word forms and their frequency counts. The counts sum up to 404 million words, which was the size of the corpus at that point.

Then I wrote a script to do the following:

  • Load the CMU data into a dictionary, mapped by the word form.
  • Read each line of the COCA file, and look up the word form in the CMU dictionary. If an entry is found, then each of the syllables for that word form gets a top-up of the frequency count of that word form.
  • At the end, sort all syllables by frequency distribution.
  • Convert the Arpabet coding into IPA.

Results

The whole script took only a few seconds to run on my Mac.

It matched 389,143,519 of 404,253,213 tokens (166,585 of 497,187 types) in COCA to CMU entries. That means, it has a 96% chance of knowing the pronunciation of a word in all plausible materials encountered in English.

It found 18,517 unique syllables — including those contrasted by stress (possibly why the number looks higher than Chris Barker’s results).

The most frequent syllables:

The 200 most frequent syllables in General American English, with non-word syllables highlighted.

The full list of English syllables in COCA: https://gist.github.com/gartenfeld/597d2d8e750d4748115da02784a9eb8e

Discussions

Primary and secondary stresses are retained. For example, #109, ‘in’ the preposition vs. #6 or #132, ‘in-’ the prefix. But then, #192, ‘im-’, which is actually an allomorphe of ‘in-’, has no stress information. These nuanced ambiguities make me question the consistency of the CMU data.

Just another example: the pronunciation of the clitic “’s” — which should be part of the coda — I’m clueless why CMU gives ‘ehs’ as the pronunciation. In fact, entries with a leading punctuation are excluded.

COCA differentiates between lexical classes, e.g. “rose” the flower and “rose” the past tense of getting up are listed as separated entries—in this example they also happens to be homophones, but ‘content’ the noun and ‘content’ the adjective are not. There’re 34,770 duplicates in total.

The CMU dictionary also lists multiple pronunciations on separate lines for homographs. There’re 10,039 of them! There is, however, no easy way of matching them to anything since CMU only gave each a sequential number which is hardly discerning.

The article’s subtiltle says “estimations” because I ignored these hurdles and simply used the first (non-numbered) pronunciation for every entry, and counted the constituent syllables as usual even if a homograph had been processed before.

A better method would be to use curated corpus materials that are coded with phonemic transcription as well as syllabification, that way we can count the syllables in situ as they occur in context. One such database may be CELEX-2. Sadly it’s shrouded in a dense funk of proprietary zeal.

But now, for something fun!

Try this interesting experiment: if you take a stressed syllable from the list, then append an unstressed syllable to it, the resulting bi-syllabic trochee would either be an English word, or sound like a plausible one, e.g. prockle’. The same principles are observed in creating “fake English”. It’s very likely that you can also do such impressions with “other languages”.

--

--

David Rosson
Linguistic Curiosities

Jag känner mig bara hejdlöst glad, jag är galen, galen, galen i dig 🫶