Transfer Functions — ITD, ILD, and HRTFs with Applications in Swift

a deep dive of the algorithms behind spatial audio, it’s applications for VR headsets, and how we can code such algorithms.

14 min readJan 30, 2024

Ignoring the math behind spatial audio for VR is like ignoring user research for great product design — let’s dive deep into spatial audio!

Now to really understand the math behind HRTFs and spatial audio, I explored the realm of audio and came to these key ideas:

Spatial Audio or 3D Audio is recorded as normal (2D) audio and then converted into spatial audio through a number of post-production processes.
The main algorithms behind these post-production processes are:
- Interaural Level Difference (ITD)
- Interaural Time Difference (ILD)
- Head Related Transfer Functions (HRTFs)

⌛️ Interaural Time Difference (ITD)

The first method is called Interaural Time Difference or ITD is a sound localization cue and this method is measured in terms of time. In easier terms, the ITD is the difference in arrival time of a sound between two ears.

With ITD, we can use that number to delay a sound from ear to ear, causing a 3D environment to be formed in the listener’s head.

More on it’s applications later, Let’s dive into the math 👇

(rθ + r sin θ) / c

r = radius of the listener’s head

θ = the angle of from where the sound source is coming from(also known as the azimuth)

c = speed of the sound wave (mainly calculated in meters/second)

With this formula in mind, let’s do a quick example just to show you how this could be applied in post-production or coding.

Step 1: Setting θ

In the example above, the listener’s sound source is coming from behind them, so if you go back to trig and unit circle — the sound source sits directly at 180° which needs to be converted into radians ,so θ = π . Here’s a list of main sound angles you can memorize if solving for ITD:

0 Degrees (0 Radians) Azimuth: Directly in front of you.
90 Degrees (π/2 Radians) Azimuth: To your right.
180 Degrees (π Radians) Azimuth: Directly behind you.
270 Degrees (3π/2 Radians) Azimuth: To your left.

Step 2 : Setting r & c

The next step is to set our r & c variables. In these two cases, I took the average diameter of the human head ,18 cm and divided it by 2 to get a radius of 9 cm. After that, you need to convert centimeters to meters as our c variable is in terms of meters/second. That being said, r = 0.09.

In setting our c, the average sound of speed is 343 meter/second so c = 343.

Step 3: Solving the equation

The last step to this example problem is solving the equation, here it should be pretty simple to understand but here’s a quick explanation of the work:

Solve 0.09 sin (π). sin(π) = 0, therefore the entire expression equals 0
Solve 0.09 π , this equals 0.282743
divide by 343 (this our sound speed, we express the answer in m/s) = 0.00082 m/s

Just from this solution alone — we can state that the ITD at this point in space (behind the listener’s head) is really small that the listener’s ability to hear this delay is probably minimal and will make a very marginal difference in the overall listening experience.

… But, if we combine ITD numbers with ILD numbers — the combination make for a solid cost effective post production process that allow for some pretty good audio experiences. Let’s dive into ILD!

🔊 Interaural Level Difference (ILD)

ILDs on the other hand from ITDs are far more complex and varying equations that rely a lot of anatomical knowledge of the ear — and we’ll get to one of those methods with HRTFs, but if you need to know one thing about ILD’s, it’s this: the difference in loudness and frequency distribution between the two ears.
ILD’s will essentially allow you to understand in which ear is sound louder or quieter and help you determine where the sound originated from.

With that in mind, let’s get to the math 👇

ILD functions have many different interpretations and many variations as there are an infinite amount of factors when calculating sound density. From my research, the best formula that takes the overall phenemona which include the sound angle (azimuth) and the frequency is:

ILD = 0.18 √f sin(θ)

Let’s take the same model from above and find the ILD of those conditions:

Step 1: Setting θ & f

In this example, we are using a slightly modified version of the last model, instead of calculating speed of the sound, we look at the frequency of the sound measured in Hz.

For θ, we keep it to π radians as the distance has not changed. For f, I went with 432 hz as that sound allows for “Enhanced Clarity” in the mind.

Step 2: Solving the equation

an ILD of 0 dBs, I was confused too — but here’s a property of ILDs you should know:

There is no level difference when sound waves are in front or behind you. Meaning when θ = 90° (π/2) or 180° (π), ILD should always equal 0
The effects of an ILD will increase once your azimuth(θ) hits 45° (π/4).

That being said, let’s see what happens when we replace π with π/4 👇

Now look how useful this is — we are able to see that just by changing the positioning of the sound source we are able to get a better delay in the intensity of the sound — creating a better environment.

This being said, ILDs and ITDs are very good algorithms to utilize when post producing audio but they only scratch the surface and can visually only give us what happens to the ears. If we want to add more precise spatial tracking — we need to study HRTFs or Head Related Transfer Functions.

👦🏽 Head Related Transfer Functions (HRTFS)

The last method and most important to understand is the Head Related Transfer Function. To understand this equation we need to understand the following parts:

Understanding Laplace Transforms
Understanding Transfer Functions

Laplace Transforms

Laplace transforms are the key to understanding how a transfer function works. Laplace functions allows us to convert a function in the time domain (t) into a function into the frequency domain (s). It’s also heavily used to convert differential equations into algebraic expressions. (it’s a tool that makes equations easier to understand) Here’s the formula to focus in on:

Essentially the laplace transform , L{f(t)} is an improper integral (because the max limit is infinity) where you multiply the function f(t) * e^-s *t. This last part is where you convert from the time domain into the frequency domain.

That being said, I needed to learn more and actually do some practice problems to understand the concept. As said in definition up top, one of the uses is to convert differential equations into algebraic expressions. Here’s an example I worked through :

In this example we start out with a basic differential equation you might find in a calculus class. We then apply the notation of the Laplace Transform by isolating the y’s and taking the coefficients out. In the next slide, were going to use that rule to replace the first and second derivatives into an easier expressions.

In the second and third steps — we simplify all the parts of the equation to be Laplace Form so we can then manipulate to an algebraic form.

Now, we have it in laplace form, remember we had those two variables in the first step where y(0) = 2 and y’(0) = 3. Take a look at the equation where we can now substitute those values in and further simplify.

In this fifth & sixth step, the second part to doing these Laplace Transforms is Partial Fractions (something you probably learned in Algebra 1/2). Essentially, we’ve hit the most simplified version of the equation where we have to do a factor of the dominator and find A and B.

In this step, we can plug back in A & B, 9 and -7, to the simplified equation. Now here’s where Laplace functions flow back in — L{e^at} = 1/s-a. We essentially have that in our answers but just the opposite integer. We can use this rule, do a little bit of pattern matching and land on an answer where the equation where L{y} = 9L{e^-2t}- 7L{e^-3t}

In the last step, we just solve for Y and the Laplace transform essentially cancels out and gives us the expression for Y, which is Y = 9e^-2t -7e^-3t.

With the fundamentals of the Laplace functions understood, we can get a deeper understanding behind the mechanics of transfer functions.

Understanding Transfer Functions

Transfer functions are simply this, take any engineering system you know and it is a comparison of the output and the input to understand how the system works. Let’s use an example to illustrate this:

Let’s say we have a airplane, an airplane is a pretty complex system with multiple sub-systems. The transfer function will allow to take 1 input and 1 output and find any causality between only those two variables. For example, we could use a transfer function to explain causality between the rudder of the aircraft (input) and the angle of attack (output). This is really good example to help you learn more about the system as a whole — see the full video to get more info.

Let’s go through a quick example problem of a transfer function:

In this example, we are going to a look at very common example problem that many engineers are known to — the mass & spring problem. In the drawing above we are trying to compare our input, u(t) which is the force the mass is being pushed at and output, y(t), which is the position the mass ends up at. There is also the constants k and c for the spring and the dampener. We essentially use a basic equation of motion to derive the differential equation for the system.

In the second step we take out the constants and apply the Laplace transform notation and get ready to substitute the y’s for TF Differential Theorem. There are two key things to note here:

Transfer Functions convert variables in the time domain (t-domain) to the frequency (s-domain)
- Here’s a quick explaination on the two domains: The time domain is where most of our reality goes through; things we can see. The frequency domain is stuff we can’t see and really the frequency of sounds & their amplitudes. More on this here.
The intial conditions of the system always need to equal zero. This is so that the mathematical model we get is an objective view of the relationship between output and input.
- These are denoted in the equation by y(0) and y’(0)

In the third step we convert every part of the Laplace Transform using the differential theorem and into the s-domain.

In the fourth step we set the initial conditions to zero and simplify to solve the equation for G(s), which is Y(s)/U(s) → 1/(ms² + cs + k). Now this is cool and all, but we really need to visualize this transfer function to see what a hypothetical situation could be like. To do this, let’s use Mathematica, a tool by WolframAlpha!

So to create a more realistic solution, I’ve set some constants for m, c, & k and solve it with those variables in mind.

The next step is to set our U(s), which I am taking the Laplace transform of a unit step (a concept in a physics) which is 1/s. We use that as an input to solve for Y(s) and the output is an equation.

Everything so far is converting from time domain to the frequency domain, to visualize the graph we need to convert it back into the time domain so us (humans) can explicitly see the causality. To do this we use the InverseLaplaceTransform call to convert the equation and we are left with a graphable equation.

This is the plotted graph for time to be in a range of 0 to 10. We can see the causality between our force and the end position. Now that being said, I was curious how does the graph looked like for HRTFs and here’s what I found:

Here’s some conclusions we can come to with these graphs:

The causality that these graphs are four- dimensions. The two main inputs and the control variable are: Frequency being compared with Normalized Magnitude of a sound wave.
The changing variables of these graphs is at the azimuth of the sound source and which ear the sound is being hit at.
These graphs are the result of very expensive testing that is highly specific to a listener’s ear size, head size, and general upper body shape.

HRTFs have no 1 specific equations that can be stated as fact because of how highly specific & customized the process is. Many companies try to not use specialized HRTFs and take averages of lower-cost solutions such as ILDs and ITD to create immersive audio experiences. With this in mind, I wondered if there are companies creating HRTF innovations to help programmers integrate spatial audio into their applications…

and that’s when I came across this patent by Apple that is building out innovative virtual HRTF maps for programmers and users to create better spatial audio and signs of this is showing up in Swift libraries.

👨🏽‍💻 HRTFs and Spatial Audio in Swift

We’ve learned about all these algorithms — there are multiple ways these algorithms are applied: music production, speaker production, software applications (spotify, apple music), and VR audio experiences.

As someone who is passionate about programming in Swift, i’ll explain how we can see some of these in swift:

AVAudioEnvironmentNode

class AVAudioEnvironmentNode : AVAudioNode

AVAudioEnvironmentNode is inherited from AVAudioNode class which connects to old apple frameworks that were built in objective-c (an older apple programming language)

There are some structure & variables that directly relate to this class as they can relate to the three algorithms:

To set your positional properties for spatial audio environment, we can use the following code:

var listenerPosition: AVAudio3DPoint { get set }

You set the the listener’s position in the space (in terms of meters).

var listenerAngularOrientation: AVAudio3DAngularOrientation { get set }
var listenerVectorOrientation: AVAudio3DVectorOrientation { get set }

These two variables allow for the user to make a more detailed approach where the the listner is sitting. It uses a x-y-z plane to plot a certain point where the user is. AngularOrientation allows to control to see at what angle the user is place, VectorOrientation allows to move the vector of the listener’s location across the environment. A change in one of them will correspond a change in the other.

The default values of AngularOrientation sits on the -z axis has (0°, 0°, 0°)

The default values of VectorOrientation sits on the -z axis and can be moved a forward vector by inputting (0, 0, -1) and an up vector by inputting (0, 1, 0)

AVEnvironmentAudioNode helps in setting up positional properties, there are more properties in Swift that directly integrate HRTFs.

AVAudio3DMixingRenderingAlgorithm

enum AVAudio3DMixingRenderingAlgorithm : Int, @unchecked Sendable

An enum in AVAudio Framework is AVAudio3DMixingRenderingAlgorithm where there are multiple cases that the users can pass through their apps:

case auto
case equalPowerPanning
case HRTF
case HRTFHQ
case soundField
case sphericalHead
case stereoPassThrough

We really care about three cases in this enum HRTF,HRTFHQ, & sphericalHead

In the first one, case HRTF, we set case HRTF = to an integer that allows us to set the map of the listener in a HRTF environment. This code is integral in creating more personalized HRTF maps for VR environments.

In case HRTFHQ , this allows for more accurate mapping and sound localization in audio rendering. It still requires an integer to pass through the code as in the previous one.

The last case is case sphericalHead which simulates a 3D environment for headphone users and can implement interaural time delays. More likely than not this code has algorithms such as (rθ + r sin θ) / c built in the back-end of swift.

What’s the main difference between these three? Well, it’s pretty simple — To build a HRTF map of a user, the cases — HRTF,HRTFHQ are more CPU intensive and more customizable but cost more money.

Most VR companies tend to go with programs that utilize parameters along the lines of case sphericalHead because they are easier to calculate and are cost effective.

If you read all the way through and still confused, here’s a summary what we just learned:

Spatial Audio is the key to building immersive & amazing VR applications
There are three key algorithms behind Spatial Audio: ITDs, ILDs, and HRTFs
ITDs measure the difference in time when a sound travels from one source to both ears
ILDs are more complex algorithms that measure the decibel difference between a sound hitting both of your ears
HRTFs are highly specific functions that are graphed to explain where and how hard a sound is hitting you. Apple is already building innovative ways for developers to include HRTF maps in their code.

👋 Hey i’m Piram and i’m an aspiring UI/UX designer, exploring how AI is changing the design world. connect with me on linkedin to follow me on this journey!