I Would Like to Buy a Hamburger

Notes on Auditory Training

David Rosson
Linguistic Curiosities
6 min readFeb 28, 2019

--

McGurk Induction

June 30, 2014

Notes: The title refers to the McGurk Effect.

Parent: Say “Tur”
Child: Tur
Parent: Say “Tle”
Child: Tle
Parent: Say “Turtle”
Child: Kurka

Not kidding, this is exactly what happens when my parents try to learn English. And they indeed put in effort with many repetitions. It’s just a marvellous curiosity what they actually hear, and what actually happens in their psychoacoustic and psycholinguistic circuits that produces virtually unfathomable results when they try to replicate an audio input.

The errors are not constrained by their native language, in fact, they produce recalcitrantly the whole range of speech errors. And it showed the same kind of long-term resistance as in child acquisition, that is, correction through instruction would at best have a fleeting effect.

Therefore, I was thinking, whether it would be helpful to employ a phone-by-phone drilling regime… More importantly, the demos should feature videos of a mouth model (yes, this is a profession)…

  1. Clear views close-up from multiple angles. Some approaching examples:
http://youtu.be/SL9hkU1y3zU?t=1m15s
http://youtu.be/rcbBd8-In40?t=1m37s

2. This is best coupled with step-by-step demonstrations, preferably with a lot of exaggeration and some almost-psychotic fervour, e.g.

http://youtu.be/h5LO0hHGfQg

3. Example words are also very helpful:

“American English Pronunciation Dictionary”

4. Then, there should be profile-sections of the anatomy, for what you couldn’t see on a video.

Image from doi.org/10.1016/j.csl.2014.12.003

5. Next, the sections, along with the videos, should be animated:

6. And an additional fun feature should be some kind of “tweakable feedback” where the user can adjust the 3D positioning of the configurations on a mannequin, then hit play and hear what sound comes out accordingly…

http://youtu.be/Cwy2RZPzVgk
Pink Trombone

Awareness and Sensitivity

July 18, 2014

Auditory Awareness:

Which one sounds like a bear / bird?

Phonological Awareness:

Which words rhyme with ‘bird’?

Segmentation, Blending, Insertion, Substitution, Reversal

(Parallel to Production Errors)

Boundary Detection:

Words — Syllables — Idealised Components

Quality Sensitivity:

svs.shvs.tch

Quantity Sensitivity:

‘Stadt’ vs. ‘Staat’

Minimal Pairs

Multi-Foil, Cycle-Until-Pass Leitner Boxes

Surround Sound

Looped Feedback Experiment

July 19, 2014

For the L2 subjects to learn to produce a new set of allophones, it may be helpful for them to hear what they are producing in real time.

The setup is similar to that of a voice-over studio, with two key elements: the mic feeding into the computer, and the enclosure headphones.

First, the demonstration should include two parts:

  • Videos and 3D diagrams from multiple angles, perhaps with slow motion, to explain the anatomical construction of the phone.
  • High quality audio samples from different actors, preferably matched to vocal characteristics (sex, base pitch) of the learner.

Each cycle will go through these steps:

  • A sample phone is played, along with visual cueing by waveforms (of MFCC or whatever).
  • Learner tries to produce the same sound;
  • The production is fed in and played back by through the headphones instantaneously.
  • Visualisation of the learner’s production is also drawn and shown on the screen, with analytics and colour cues.
  • With further speech analysis modules, more detailed analyses and feedback can be given about the formants, timing, and other acoustic features. These can also be summarised as some simple indication of accuracy (that is, “whether the learner has hit the target”).
  • Each phone will be repeated through many cycles, until and even after reaching consecutive sessions of success, and over many days.
  • The end result should be easily solicited accurate production of the phone, and rapid self-correction when deviation occurs (“allophonic awareness”).

The training will probably start with monophthongs, since they are less ambiguous. As McGurk Effect demonstrates, the difference between stops (e.g. ‘da’ vs. ‘ga’) can be very minute. When played as isolated sounds, there is the closure, and then there is the release; it is difficult for both machines and novice learners to discern the acoustic differences (mostly found in context). Therefore, these phones should be trained in conjunction with their co-articulatory neighbours, e.g. ‘down’ vs. ‘gown’.

After the individual sounds, we might as well move onto clusters, going through the probabilistic list of phonotactically plausible onset and coda clusters, and rhymes.

Auditory awareness exercise

We should also go to very busy places (malls, train stations, etc.) and record hi-fi bi-aural samples, then ask the listener: how many different sounds can you hear in the environment?

By the way, I just found some really weird stuff on binaural recording [the internet phenomenon of ASMR] …

The problem with L2 imitation tasks

July 24, 2014

What we hear does not represent a hi-fi rendering of the real sounds. When the sound reaches our perception, it has already been processed and distorted. It is the same kinds of ‘distortion’ and ‘optimising’ that give us the amazing ability to extract meaning in a noisy, cocktail party environment.

When novice learners try to ‘imitate’ and re-produce the L2 phonemes, the input signals must be squeezed through a special, convenient filter, removing details, and adding some stuff not really there, essentially to facilitate the production process. Therefore, this task adds pressure to make the input more heavily “processes”.

Whereas, if you imagine a mindfulness exercise or some meditation-like task, where you sit in a forest or a busy market, trying to discern the dozens of different sound sources, from wind to trees to multiple kinds of birds, without having to do anything about them — in that situation, the input is probably left more in its original, raw form.

And because of the selective nature of processing, stimuli can often be attenuated due to de-sensitisation, hence leading to loss of the range of signals that would be “perceivable” to begin with. It is the observer’s paradox in the introspective sense.

Children’s resistance to corrections may be of a different cause…

--

--

David Rosson
Linguistic Curiosities

Jag känner mig bara hejdlöst glad, jag är galen, galen, galen i dig 🫶