How Airtime Utilizes Objective Audio Quality Analysis

Tech @ Airtime
Airtime Platform
Published in
7 min readApr 30, 2021

by Justin Wong

Previously, I discussed my research on the most effective way to objectively analyze audio quality at Airtime here. But how can we use this information on audio quality analysis to actually help us maintain and improve Airtime’s audio quality? Simply cutting and pasting audio files that went through Airtime’s encoding process into ViSQOL, our audio quality analysis API, would not be sufficient. Creating a robust and flexible tool that leverages ViSQOL is paramount to making beneficial and impactful changes to Airtime.

Huron

Huron is an application that takes in an original audio file and its degraded counterpart as inputs and outputs the results in JSON. The JSON file contains the MOS (mean opinion score) indicating the quality score of the degraded file relative to the original. The workflow of Huron is as follows:

Huron Workflow Diagram

At the beginning of execution, the Huron main application will create an AudioController, which is responsible for returning the MOS generated by ViSQOL’s API given the original and degraded audio files passed in from the main app. The AudioController will then create two WavParsers — one for the original audio sample, and one for the degraded audio sample. Each of these WavParsers will asynchronously decode the audio into a discrete array of numeric values. The WavParsers are also responsible for resampling the audio and converting it from stereo to mono if necessary. When each WavParser finishes decoding its respective audio, it will send a signal back to the AudioController to indicate that it is finished. When the AudioController receives a finished signal from both WavParsers it will trim the parsed audio data, such that only the common parts of both remain. Finally, the AudioController will pass in the trimmed data to ViSQOL’s API receive the MOS as a promise. The AudioController will output the result as a JSON, and then fire a signal to the main app, letting it know that everything is finished, and that it can exit safely.

Now that we have a tool that allows us to generate a MOS given an original audio file and a degraded audio file as inputs, we can see how network and CPU constraints affect audio quality at Airtime.

Results

I gathered data sets using an iPhone 8 by recording audio that went through Airtime’s encoding process and applied network and CPU constraints to the iPhone. Different results are expected under different constraints, because the encoding process takes into account available CPU & network resources before generating the encoded output. Both one-variable and two-variable tests were used to observe how constraints impact audio quality both alone, and in conjunction with other constraints. Additionally, test scenarios for both speech and music were covered.

Although we can evidently see that the score decreases as we have less bandwidth, the scores are all very similar until the last column. It is entirely possible that a moderate constraint can have a higher MOS than an output with no constraints due to variance. It appears that the audio quality only starts to noticeably decrease once the bandwidth drops below a certain threshold but once the bandwidth drops below this threshold, the audio quality drops extremely quickly as shown in the 100Kbps example.

Initially, the score seems to decrease as packet loss increases. However, the moderate packet loss and severe packet loss have a similar average. This is likely due to variance. The 1.53 outlier in the moderate packet loss example was one where a very large portion of the beginning of the file is just completely missing. The outliers in the severe packet loss example (3.71, and 3.27) were due to insignificant parts of the audio being cut off. Since the packets being lost are random, there is a lot of variance in MOS, as it is unknown whether significant or insignificant packets of the audio will be lost. However, there is still an overall trend where MOS decreases as packet loss increases.

It seems like there is a weak correlation between the CPU utilization and MOS when only audio is being transmitted. However, as CPU utilization increased, there were very occasional short hiccups in the degraded audio file, which would probably explain the cases where the score was 3.32 and 3.21. Overall, even at relatively severe CPU utilization the audio quality is only sometimes affected, and when it is, it is not severe.

The above table is an example of a 2-variable test where both bandwidth and packet loss constraints were applied. We can observe that adding a moderate bandwidth constraint has essentially no effect on the MOS. The table where we have severe packet loss and moderate bandwidth constraint is very similar to the severe packet loss one-variable table. It appears that the MOS strongly gravitates towards the scores generated in the 1-variable test case with lower scores. In the above example’s bottom-right cell, one might expect this 2-variable example to have a significantly lower score than 1.9. However, the average score is still around 1.9, because the score is gravitating towards the lower one-variable score. This is likely due to the fact that MOSs are not calculated using a linear scale, so the variable that individually affects the MOS less will seem almost insignificant compared to the other. The other 2-variable tests that have been performed further reinforce this idea.

We can observe from the above table that publishing a 180x180 video has essentially no impact on the MOS. Even a 1920x1080 video has minimal impact. For the moderate bandwidth and packet loss values, we can see that the scores are very similar to the constraint’s 1-variable MOS values. The exception here seems to be the CPU constraint. Publishing a video at the same time with a heavy CPU constraint seems to lower the score further than without the video. This is likely because when there is a video, the already limited CPU has to both render the video and process the audio. The reason that publishing a video does not affect the other constraints to nearly the same degree is probably due to the fact that bandwidth and packet loss constraints don’t strain the CPU.

Music:

It appears that the MOS significantly drops even with no constraints when a complex music sample is passed through the encoder. For the bandwidth constraints, similar to speech, the bandwidth only seems to noticeably affect the MOS when it is a severe constraint.

Overall, the music MOSs show that the score doesn’t drop nearly as much given constraints. This could partially be due to the fact that the audio quality score is already quite low with no constraints, and adding some constraints wouldn’t make it much worse. Despite having a smaller effect, the scores follow the same pattern as the speech examples. We can observe that the MOS also gravitates to the more affected single variable constraint score when there are 2 variables.

Because we learned how different constraints affect audio quality in different ways at Airtime, we now have a basis to refer to, allowing us to evaluate if our encoding process could be improved under specific conditions! Another application of Huron would be ensuring that any changes to our encoding process does not negatively impact the audio received by users.

Using Huron in Automation Tests

To ensure that we maintain our current level of audio quality, Huron was added to Airtime’s automation environment. When features are added to Airtime’s encoding process, the tests in the automation environment are run to verify that everything is working as intended. The new tests recorded an audio file that went through our encoding process, and passed it into Huron with the original file to receive a MOS. The MOS was then checked against a pre-defined value based on the results gathered from the tables above. The tests added were as follows:

Huron Automation Tests

The goal is for these tests to have high accuracy, while not having an excessively long runtime. To do this, we ran each test case three times, and if the median score was above the expected MOS, it would be considered as a passing test.

These tests serve as a baseline for Airtime’s expected audio quality under different circumstances. With these tests in place, we can ensure that our level of audio quality is maintained when any changes are made to Airtime’s encoding process. Additionally, if we are looking to improve audio quality under certain constraints, utilizing these test cases is a simple an effective method to check if these optimizations are working as intended!

--

--