The Vowel Trapezoid

Celine Y Lee
9 min readDec 26, 2021

--

If you are like me and did not think much about vowels until this blog post, you may have also assumed that vowels exist as discrete entities: ‘a,’ ‘e,’ ‘i,’ ‘o,’ ‘u,’ and sometimes ‘y.’ Then based on the language or dialect or even accent, each one has slightly different pronunciations or behaviors based on the word (or context) in which it appears. (For example, ‘u’ appears twice in the word ‘cucumber,’ both times after the letter ‘c,’ but each time it is pronounced differently: /kyo͞oˌkəmbər/ (kYOO-kUHm-brr).)

Actually, as it turns out, vowels exist on a spectrum. In this article, I will continue to refer to this spectrum as the vowel trapezoid. In linguistics, it is also broadly referred to as the IPA vowel trapezium, vowel diagram, or vowel chart. Vowels pronunciations exist as points (or more generally, regions) on this trapezoid:

Vowel trapezoid for American English, with example words. (source: allthingslinguistics)

The trapezoid serves as a 2D representation of our mouths in the profile pane; the shape represents the space of the tongue movement as we speak. The top longer edge corresponds to the roof of the mouth from front (teeth) to back (throat), and the bottom shorter edge corresponds to the bottom of the mouth also from front to back. Note that the left and right sides of our mouth are not dimensions on the trapezoid — generally, we are pretty symmetric in that regard.

In the vowel trapezoid shown above for American English, the top left /i/ (as the third ‘i’ in “bikini” or the ‘ee’ in “beet”) vowel sound corresponds to your tongue going to the top front of your mouth. Likewise, when making the /u/ (“prudent” or “boot”) vowel, your tongue moves to the top back of your mouth. When making the /a/ (“law” or “bot”) sound, your tongue moves to the bottom back of your mouth.

Our tongue movements have become so familiar or second-nature, making it difficult for us to observe (and realize how weird it is). One fun practice is to take a lollipop and stick it to the bottom of your tongue, then observe how the stick moves around as you sound out the vowels. You can also reference this handy Wikipedia vowel diagram with audio clips. The pinktrombone is another really neat website that lets you play around with a simulated vocal tract to make vocal (mostly vowel) sounds.

Non-American-English

If you have an American accent, your vowel phonetic mouth movements may map closely to the diagram shown earlier. If not, you may find that your vowels map elsewhere on the trapezoid. Different languages do have somewhat different vowel trapezoids. A couple are shown below:

Vowel trapezoids for French, English, and German. V=vowels. (source: Cross-Linguistic Influences)
Vowel trapezoids for Arabic, Greek, Spanish, and Chinese. V=vowels. (source: Cross-Linguistic Influences)

As an aside, you may notice that the number of vowels on the vowel trapezoid also differs by language. This variance by language extends to all phonemes. (Phonemes are units of sound composed to build words, including consonants as well as vowels.) Lingthusiasm hosts an episode on this very topic, which I highly recommend bookmarking to listen to later. The episode discusses how babies learn to distinguish phonemes (as it turns out, they actually learn to differentiate fewer distinct phonemes as they learn a given language) and how learned phonemes change the way we hear language.

As for the vowel trapezoid, I will illustrate the variance in vowel diagrams by comparing Australian English to (Californian) American English. Their respective vowel trapezoids, with word examples, are shown below:

Vowel trapezoids, with examples, for Australian English (left) (source: Macquarie University) and for American English (right) (source: allthingslinguistics).

If you are most familiar with American English (like myself), this Wikipedia article does a great job of explaining Australian English phonology. Likewise, this Wikipedia article does a great job of explaining General American English phonology. (Diphthongs are the combinations of adjacent vowel sounds within the same syllable, such as “coin,” “loud,” and “feat.” I included a few resources at the end of this article to learn more about them.)

If you are not interested in parsing through those somewhat lengthy and jargon-y articles, you can borrow my main takeaway: most dialects of English are distinguished by their vowel phonology; depending on the dialect, certain vowels are “closer” to each other than others.

Additionally, different dialects treat monophthongs/diphthongs/triphthongs differently. For example, the Australian English pronunciations of certain vowels will often treat monophthongs as diphthongs and diphthongs as triphthongs. Diphthongs are shown in the vowel diagrams as arrows. Notice the diphthong in the top left of the Australian trapezoid. The ‘i’ in “hid” is pronounced with a gliding vowel sound from [ih] to [ee], making the Australian pronunciations of “hid” and “heed” quite similar to my American ear. On the other hand, the American pronunciation of “hid” maintains the [ih] monophthong (/I/ in the trapezoid above as in “bit”); the American pronunciation of “heed” maintains the [ee] monophthong (/i/ in the trapezoid above as in “beet”). This difference of distance between vowels / diphthongs is one of many that may exist between the vowel trapezoids of different languages and dialects.

(A fun little Australian vs. New Zealand pronunciation fact that I learned while listening to Lingthusiasm’s episode on vowels is that New Zealanders have taken to pronouncing the word “dip” as “duhp.” So that’s an easy way to tell if somebody is from New Zealand or from Australia: get them to say the word “dip,” then listen closely.)

Formants — a signal processing backing for the trapezoid

The trapezoid as a shape for vowel phonology may have come about intuitively — with the general idea that there is more “space” for your tongue to move around in the top of your mouth vs. in the bottom of your mouth — but the shape is additionally backed by the formant frequencies. (I highly recommend this resource to get a basic idea about formants and their role in linguistics.)

You may be familiar with fundamental frequencies, the basic pitch at which any sound (or any vibration in general) exists. We will call the fundamental frequency F0. Harmonics played atop this fundamental frequency are how we distinguish the same general pitch played by a piano versus by a flute versus by human vocal cords. Formants are the louder (read: more energetic) groups of harmonics produced by the human vocal cord. They provide phonetic value to the vibrations we make, i.e. vocal speech.

We can use our vocal cords to change our fundamental frequencies F0, and our mouth, tongue and throat movements to change the harmonics. Using Fourier transforms, we can actually process then analyze these groups of harmonics, the formants, to identify the role of formants in — you guessed it — vowel pronunciation.

As it turns out, F1, the first formant, maps generally to the height of the tongue in the mouth when pronouncing vowels; a higher-valued F1 corresponds to higher-position tongue. F2, the second formant, maps to the front-back placement of the tongue, with a higher-valued F2 corresponding to the tongue closer to the front of the mouth. F3 has some rough approximative mapping to “lip rounding,” though this apparently has less relevance in the English language. F0, since it is just the fundamental frequency, carries no linguistic information, though it does allow for the use of tone and pitch in interlocution.

Formant frequency chart for American English. (source: https://ccrma.stanford.edu/~jmccarty/formant.htm)

The formant frequency chart above shows which F1/F2 regions map to different vowels. Observe that the broad shape of the total region — a close approximation to the vocal trapezoid!

The impact on automatic speech recognition

While this is interesting purely on a theoretical level, it also has interesting impacts on automatic speech recognition (ASR). (This is not intended to be a machine learning blog entry, so I defer to other surveys on the history and state of ASR. I will do my best to link in-text references to approachable and helpful supplemental articles that provide some background on the concepts at hand.) One common component to many ASR systems is feature extraction. In feature extraction, the (pre-processed and cleaned) input audio signal goes through some (often mathematical, in this case) processing to extract information deemed relevant by the system designers to speech recognition. One, as you may surmise, is the formants.

As an aside, collecting formants is far from straightforward. Consider even a simple audio signal of somebody saying ‘hello’. If we want to recognize the ‘e’ sound, even if we know that we want to extract the formants from that section of the audio clip, how is a system to know when it will occur and how much of the clip that ‘e’ sound lasts for? ASR system designers have worked with many different ways to represent such frequency information, including but not limited to the Mel-frequency cepstral coefficients, linear predictive coding, and discrete wavelet transforms. These are imperfect measures, so most ASR systems augment the end of the speech recognition pipeline with a language model.

Consider the different language vowel trapezoid charts shown earlier in the blog entry. I have copied some again below.

Vowel trapezoids for French, English, and German. V=vowels. (source: Cross-Linguistic Influences)

As humans practicing language, when we exchange dialog with another person speaking in a different accent than our own, we likely mentally map their vowel space onto ours so that we can recognize vowels pronounced in a different way. As learners (and practitioners) of a new language, or when we try to pronounce a word in a foreign language, we may find a way to map the foreign vowel sound into our vowel trapezoid, electing to cluster the new sound into our existing familiar vowel space. Moving forward, as we become more practiced practitioners of this new language, we may often retain our original vowel clusterings while learning new words, thus producing what others may perceive as our accent in that language. (I have linked some more in-depth studies of accents and second languages at the end of this article.)

Machines, however, may not do this. Instead, a speech recognition system that is trained on American English may observe an audio input spoken in a French accent and map the ‘y’ sound to the ‘e’ English vowel, disrupting the character sound recognition module of the ASR system. This would hopefully be corrected by the language model, but in many cases may slip through the cracks, and in the worst case may completely disrupt the language system, incorrectly adjusting the expectation of the language model for the subsequent input words.

Summary and takeaways…

To summarize the overall takeaway that I had from this mini-venture into linguistics:

  • vowels exist on a continuous spectrum: the vowel trapezoid
  • differences in language, dialect, and accent correspond to different clusterings of vowels in the vowel trapezoid
  • one key struggle of automatic speech recognition systems can be attributed to the continuity of such vowels in the vowel trapezoid, and the overlapping / varying clustering of vowels in the trapezoid across different language dialects.

(Approachable) Follow-up readings:

If you are hesitant to accept the vowel trapezoid as the vowel diagram, or if you are interested in reading more about vowel spaces, you are not alone! Some (of the many) interesting accessible readings on alternative vowel diagrams and explanations are:

  • This blog post provides an explanation of the continuous vowel space, using the color spectrum as a helpful metaphor.
  • This book chapter provides a summary of the Vietor triangle, a schematic representation that shows the position of the tongue and jaw when creating different vowel sounds.
  • This Quora answer explains the number of vowels in the English language. (It’s closer to ~15 rather than 5!)
  • The author of this blog entry challenges the “strict geometrical” aspect to the quadrilateral vowel diagram, proposing an alternative modified vowel chart.

The promised readings on diphthongs:

On accents:

  • This linguistics article discusses the difficulty in pronouncing or even detecting the difference between sounds in a foreign language.
  • This psychology article discusses the evolutionary and neurological origin of accents.
  • This paper examines the perception of accented speech across the lifespan, from early infancy to late adulthood.

--

--