How many rhymes are there in English?
I’ve always enjoyed word play, puns, double dactyls, spoonerisms but especially rhymes. I’ve actually written 2 books of poetry for my wife. Something you don’t really think about until you are writing your 236th poem is that coming up with a rhyme you haven’t used before is simply impossible. There are only so many rhymes in the english language. Pretty much all you can try for is not to use the same ‘type’ of rhyme too much.
I’m defining a ‘type’ of rhyme or rhyme group as a group of words that all rhyme with each other. An example would be:
plad, sad, mad ,dad, bad
When you are writing poetry in the couplet form, AABBCCDD, you don’t want to use to much from the same rhyme group because it can sound odd. I wondered how many groups of rhyming words there were in English. To get a computer to figure this out for me though I needed a precise definition of a rhyme.
Looking at the wiki articles on Perfect Rhyme it says there are 2 requirements.
1. The stressed vowel and everything after must be the same.
2. The articulation before it must be different
The first rule is there because to rhyme two words must end with the same sound. So for example tricky and icky rhyme because everything after the ‘i’ sound is the same.
The second rule is there because “leave” and “believe” do not sound like a rhyme to most native English speakers. This is because the stressed vowel is preceded by the ‘l’ sound in both cases
Python has the Natural Language Toolkit, which among its many packages includes the cmudict. This has a dictionary that maps an English spelling to its pronunciation represented as a list of phonemes. Phonemes are the sounds that make up words in English. A nice property of this dictionary is that the vowel phonemes are already marked as stressed or unstressed.
From this dictionary I generated two additional dictionaries. One was a simple reverse dictionary that mapped a pronunciation to the English spelling it represented. An interested point is that one pronunciation can represent multiple written words, i.e. homophones such as “two”, “too”,”to” all have the same pronunciation which the cmudict gives as
The second dictionary I created was really just a way of grouping words by their ending sound, so the phonemes
[‘B’, ‘EH1', ‘R’]
which is the pronunciation of “bear” and
[‘DH’, ‘EH1', ‘R’]
which is the pronunciation of “there” are both in the same list of values with the key
The rule for the grouping of this dictionary is rule 1 of what makes a rhyme from the Wikipedia article. What this means in practice is that the words “tricky” and “picky” which have the pronunciations
[‘T’, ‘R’, ‘IH1', ‘K’, ‘IY0'], [‘P’, ‘IH1', ‘K’, ‘IY0']
would be in one group and that “see”and “bee” which are represented by
[‘S’, ‘IY1'], [‘B’, ‘IY1']
would be in a different group, which is what we want because “tricky” doesn’t rhyme with “bee”. I called this dictionary rhymeToPros since it maps a rhyme to pronunciations.
At this point I had 3 dictionaries that I needed to combine values from, the CMU pronunciation dictionary, the reverse dictionary and the rhymeToPros dictionary. To make it easier to work with I added a rhymeGroup class which has 3 properties, The ending part of the word that matched, the pronunciations associated with it and the words associated with those pronunciations.So for “tricky” and “picky” the properties of the rhymeGroup would be:
rhyme:[‘IH1', ‘K’, ‘IY0']
pronunciations:[‘T’, ‘R’, ‘IH1', ‘K’, ‘IY0'] [‘P’, ‘IH1', ‘K’, ‘IY0']
words: tricky, picky
After filtering out any rhymegroups that only had one pronunciation or word associated with them Rule 1 is enforced.
Being “conscious” of Rule 2
The second rule was that the articulation before the stressed syllable has to be different. A naive version of this would be just to check that the phonemenes before the stressed syllable are different. Unfortunately just looking at the phonemes before the stressed sound leads to false positives.
“conscious” is represented by the phonemes
[‘K’, ‘AA1', ‘N’, ‘SH’, ‘AH0', ‘S’]
“unconscious” is represented by
[‘AH2', ‘N’, ‘K’, ‘AA1', ‘N’, ‘SH’, ‘AH0', ‘S’]
The phoneme ‘AA1' is the stressed syllable and the phonemes before them are different however a person would not describe this as a rhyme. These are what the Wikipedia described as identical rhymes.
Humans hear words as syllables not as phonemenes.
So to handle this type of issue you need to look at how a word is syllabified. Syllabification is how words are broken into different sounds. An English syllable has 4 properties
- A nucleus, which is the core part of the syllable
- An optional sound before the nucleus, called an onset
- An optional sound after the nucleus, called a coda.
- Whether the syllable is stressed or unstressed
Fortunately code to syllabify is available from UPenn which breaks a list of phonemes into syllables, with the stress.
Running it on the previous example:
The word ‘conscious’ only has two syllables:
(1, [‘K’], [‘AA’], [‘N’]) 1 means this is the stressed syllable,’K’ is the onset, ‘AA’ is the nucleus and ‘N’ is the coda
(0, [‘SH’], [‘AH’], [‘S’]) 0 means it is the unstressed syllable,’SH’ is the onset, ‘AH’ is the nucleus and ‘S’ is the coda
‘Unconscious’ has three syllables:
(2, , [‘AH’], [‘N’]) 2 means this is a secondary stressed syllable, there is no onset, ‘AH’ is the nucleus and ‘N’ is the coda
(1, [‘K’], [‘AA’], [‘N’]) stressed syllable, ‘K’ is the onset, ‘AA’ is the nucleus and ‘N’ is the coda
(0, [‘SH’], [‘AH’], [‘S’]) unstressed syllable, ‘SH’ is the onset, ‘AH’ is the nucleus and ‘S’ is the coda
By looking at this example it becomes obvious that the onset of the stressed syllable for ‘conscious’ and ‘unconscious’ are the same, which is why most people would not think they rhyme.
So a better rule would be to check that the onset of the stressed syllable are different, but that the coda is the same. In this case both “conscious” and “unconscious”’s stressed syllable’s onset is ‘K’ so they do not rhyme. The next step was to filter out rhymegroups that did not have any rhymes in them.
Comparing the stressed syllable’s onset seems to work (in fact the number is very close to our final count) but I decided to look at what was being filtered out. Most of them were non-rhymes but one of the rhyme groups being filtered was paternity/fraternity/maternity which sounds like it rhymes to my ears. Looking at the syllabification
(0, [‘F’, ‘R’], [‘AH’], )
(1, [‘T’], [‘ER’], )
(0, [‘N’], [‘IH’], )
(0, [‘T’], [‘IY’], )paternity:
(0, [‘P’], [‘AH’] )
(1, [‘T’], [‘ER’], )
(0, [‘N’], [‘IH’], )
(0, [‘T’], [‘IY’], )
The stressed syllable has the same onset, which is why it was treated as not rhyming. However the syllable before the stressed syllable, (0, [‘F’, ‘R’], [‘AH’], ) and (0, [‘P’], [‘AH’] ) would rhyme if they were stressed. The nucleus and coda are the same and the onset [‘F’, ‘R’] and [‘P’] are different. Which is why they sound like they rhyme. They were not counted as a rhyme because it was not the stressed syllable.
So the rule for rhyming needs to be amended, so that if the stressed syllable is identical but a preceding syllable has identical nucleus and coda but different onset then they rhyme.
Running this code on the words in the cmudict got me 10,762 rhyme groups. So barring any other edgecases that’s the number of rhymes in English.
As with most programming problems properly defining it took the majority of the time and effort. To get a computer to calculate it required that I precisely define what a rhyme was, since the rules from Wikipedia were meant for humans not computers. The actual code is only ~70 lines with another ~50 lines of unit tests on checking if two syllables rhyme
Some other interesting results
- Rhymes do not need to be on the stessed sylllable (paternity, fraternity)
- Rhymes are not transitive. “Leave” rhymes with “Steve”. “Steve” rhymes with “believe”. But “leave” does not rhyme with “believe”.
- The largest rhyme group is words that end in ‘nation’