Sound and Speech Acoustics

Muhammad Rizqi Nur
5 min readSep 24, 2022

--

The Nature of Sound

Sound is a longitudinal wave. It was made when a sound source vibrates. This vibration creates a wave through a medium, usually air. The vibration pushes air particles, making a compression, a part where they become tight, high pressure. This particle movement makes the area where they were before become emptier, creating a rarefaction. The pushed particles, I’ll call it particle A, collides with other particles, I’ll call it particle B. This collision transfers energy and momentum, which pushes particle B and makes particle A stop. This happens continuously and very fast, which forms a wave. This wave will reach our ears, which will make the membrane vibrate too. Our ears will pick up the vibration, turn it into signals which our brain picks up and finally, we hear the sound.

The difference with traversal wave is that a longitudinal wave doesn’t really have a fixed movement and vibration direction. What you see in sound waveforms as amplitude, the ups and downs, aren’t actually the direction which it vibrates between. It’s actually the phases of compression and rarefaction; high for contraction, and low for the other. So when sound signal amplitude is high, it tells the speaker (sound system) to push, creating compression, and when it’s low, it tells the the speaker to pull, creating rarefaction.

A sound is heard from the frequency and amplitude of wave. Amplitude, measured in decibels (dB), corresponds to a sound’s loudness, while frequency, measured in Hertz (Hz), corresponds to sound’s pitch. But what about speech? Musical instruments? How do we distinguish sounds? We’ll save that for later.

A little bit about the human hearing

The human ear can hear sounds between 20 to 20k Hz. This range varies between people, specially due to age. As we grow old, we lose our ability to hear high frequency sounds. But don’t worry, most sounds don’t go that high, not even half of the maximum frequency. Specially for speech, the frequency is very low, not even 1kHz. This is why calls over internet can be really cheap in bandwidth.

Due to the nature of how sound wave travels, sound mixes with each other by addition. Two waves that travels to a similar direction will push similar air particles. This makes a sound waveform that we actually hear become very complex. Even music, that is specially produced to be clear, is still very complex. Yet we human can easily distinguish between sounds.

Pitch and Fundamental Frequency

A sound, even for one sound from one sound source, consists of a lot of frequencies. The pitch that we hear came from the fundamental frequency. It is the first harmonic (F0), which is the lowest one. Harmonics are frequencies which are the multiples of the fundamental frequency. Say, the F0 is 220 Hz, then the harmonics are 220 Hz, 440 Hz, 660 Hz, 880 Hz, and so on. Harmonics higher than F0 are called overtones. When you mix harmonics into one wave, the overall wave you get will be F0, while the rest becomes little details along F0. So, when we hear or play a musical note, what determines it is the F0.

Fourier Transform

There’s this thing called Fourier transform. It is said that a complex wave can be reproduced from a real lot of simple sine and cosine waves. These are what the lots of frequencies are. They are the frequencies of sine waves that would make up the complex sound wave. They aren’t easy to determine accurately though, so we use windowing and approximations to have something that’s close enough.

Timbre and Formants

So, back to harmonics. A sound consists of a lot of harmonics which vary in amplitude. They peak at some harmonic, and the F0 isn’t necessarily the loudest. This variation of harmonic peaks are called timbre. It makes the sound unique and allows us to distinguish sounds.

The peaks of harmonics are called formant, specially in speech acoustics. Formant generally allows us to distinguish speech, specially vowels. Different vowels have different harmonic peaks. They also differ slightly between different speakers, because different timbre. Why are they different though?

A sound source is usually not loud at all. To make a sound louder, we need a resonation chamber. The chamber will let the sound resonate and become louder. This resonation differs between chambers, which will produces different sound. A musical instrument will usually have a room, such as the hole in guitar and violin, and the large space in a piano. They are resonation chambers.

We humans have our larynx and mouth as resonation chambers. It’s often simplified that the larynx will make the first formant (F1, first peak) and the mouth will make the second formant (F2). There might be third and fourth formants, but generally they don’t matter to distinguish what is being said. They still make up the timbre though.

When we speak, we mostly just use the mouth, except for some languages. The position of the tongue and the shape of the mouth affects the formant a lot. When we sing though, we should adjust the larynx to match the fundamental frequency. A formant is a peak of harmonics. When the peak of the first formant is close to the fundamental harmonic, it will sound louder, because the pitch, the note that we want to hear, will have higher amplitude. Adjusting the first formant can also make your singing sound better.

Sound Envelope

A sound envelope is the shape of the waveform in amplitude. When you speak, the sound doesn’t hold a constant amplitude. It goes in 4 sequential phases, attack, decay, sustain, and release (ADSR). It’s also the case for musical instruments. Attack, is the phase when a sound is initially made and amplitude rise to its peak. Decay, is the phase when the amplitude decreases until it reaches sustain phase. Sustain, is the phase when the amplitude is relatively constant/consistent. Release, is the phase when the amplitude decreases after sustain until it’s gone (the sound stops).

The sound envelope also allows us to distinguish sounds, so this is also a part of timbre. For example, a drum sound will generally have very fast attack phase. Meanwhile a violin will have a slow (long) attack phase. This is also the case with speech. When you say something, you usually don’t just say a simple vowel, you have consonants, which have a lot of types. Your mouth and tongue movement shapes the sound envelope a lot.

Voice Onset Time (VOT)

A speech isn’t necessarily voiced. By voiced I meant vibrating your vocal cord, usually to speak vowels. You can say “ssshhh” without doing that. This matters a lot specially when speaking consonants. Voice onset time, is the time between when you start saying something until when you’re actually voicing it. For example, a “t” in “tee” is not voiced. When you voice it, however, it becomes a “d”. However, a voiced consonant doesn’t necessarily mean having exactly 0 VOT; it can be very short but it’s there. Then there’s also this negative VOT, meaning it’s voiced even before it’s spoken. VOT also differs between languages and dialect.

Closing

Well, that’s about how sounds are created and perceived as I understand them. It’s a very complex thing. This isn’t my field of expertise though. I only learned this through youtube videos. So please do correct me if I’m wrong.

Thanks for reading.

References

It’s a playlist

--

--

Muhammad Rizqi Nur

Webdev/Gamedev/Data enthusiast. Fresh Graudate of Information Systems of Universitas Islam Negeri Sunan Ampel Surabaya