First Principle Audio Quality Assessment

Tech @ ShareChat
ShareChat TechByte
Published in
10 min readAug 31, 2021

Written by Rini Sharon, Vikram Gupta

Providing best-in-class user experience is a primary objective at ShareChat and Moj. One major component that affects the viewing experience is the quality of the content. However, given the multi-modal nature of the content, one might wonder what qualifies as a “high quality” post. Different facets such as video quality, audio quality, post content etc. contribute individually and collectively towards grading the quality of a post.

Relevance Vs Quality

Moreover, it is also important to appreciate the difference between the relevance and technical quality of the content.

A cricket lover may enjoy a clip containing Virat Kohli but for someone who is not a sports enthusiast, the clip might be low-quality because of lower entertainment value, irrespective of the technical quality.

Reverse of this situation can also be true where a cricket enthusiast might still enjoy a cricket related clip with some pixelation as it is highly relevant for him. From the point of view of quality, a high quality visual presentation with muffled audio would not be an interesting watch and in the same vein a high quality audio with pixelated visuals would not be entertaining.

At ShareChat and Moj, we compute the relevance of a content using recommendation systems which can exploit higher order analysis like network effects, popularity of the creator, trending topics etc. In this blog post, we will focus on evaluation of the quality of the posts from the audio analysis perspective.

Audio component of the posts uploaded on our platform could contain monologues, dialogues, instrumental music, songs etc. Moreover, these could be either user generated content (UGC) or professionally generated content (PGC). With such a variety of content, it becomes incredibly difficult to create a robust and reliable audio grading pipeline.

What is High Quality Audio?

Generally, we consider the audio snippets which are not noisy and clearly understandable as high-quality. Unlike meaningful signals like speech, songs or music, noise is less constrained and oftentimes uncontrolled. In common usage, it could involve some of the following:

  • Loud sounds like those arising from a boom box or firecrackers with distinct amplitude characteristics
  • Consistent background sounds like the sound of machinery running.
  • Random and uncorrelated processes and events, which may or may not reflect in the overall signal’s amplitude or frequency composition such as the sounds of city traffic (horns/sirens), dogs barking and so on. These are generally short term noise existing for only short periods of time.

Keeping these characteristics in consideration, we explore approaches that are able to capture these varied audio and noise properties in the temporal, spectral and amplitude domains.

Deep Learning Based Methods

Conventionally, audio quality assessment involves perceptual evaluation by individuals to manually generate a mean opinion score (MOS) [Robert Streijl et al., 2014, ITU P.800.1, 2016], that ranks the audio quality on a 5-point scale. With the evolution of deep neural networks, architectures have been proposed to categorically generate MOS predictions for an audio signal [Chen-Chou Lo et al., 2021]. However, these architectures require significant amount of annotated data and since the annotation is subjective, the problem becomes even more challenging [Flavio Ribeiro et al., 2011].

First Principle Methods

In the low-annotated-data scenarios, first principle methods can be very effective.

These methods eliminate the need for large amounts of labelled data and also provide better explainability.

More often than not, such protocols also happen to be computationally cost effective and memory optimized when compared to resource intensive deep learning models.

In the rest of this blog, we explore audio-based post quality assessment and present an analysis of various acoustic features based on first principles that provide discrimination in the context of audio quality.

Why is audio quality assessment difficult ?

Let us start by visualizing the amplitude of some high-quality and low-quality audio waveforms across time.

Figure 1: Clean Vs Noisy Vs Real-life post audio

In Figure1, we plot the 3 most common waveforms that we encounter in our posts:

  • Clean Speech: This is a speech signal without any background noise. Two sentences are read with a gap of 2 seconds in between them exhibiting a region of “silence” in between.
  • Noisy speech: This is a recording of a song playing on television. Due to low quality output from television speakers and re-recording, it involves a considerable amount of jarring making the lyrics hard to comprehend.
  • Clean speech audio with background noise: This is a clean recorded dialogue merged with a low quality background audio track (without lyrics). The track volume is intensified in the beginning and end of the audio file. Although the background track exists, the speech content is clearly understandable.

From these plots (Figure 1.a and 1.b), we can observe that the pattern of high quality clean speech audio can be easily distinguished from noisy low quality audio.

The clean speech audio has a lot of high-amplitude peaks and low-amplitude flat areas which align with the way humans speak. The peaks represent areas where the words are spoken while the silence is characterized by zero amplitude. On the contrary, noisy audios generally do not exhibit consistent patterns and can be very random depending on the kind of noise.

However, majority of the content uploaded on our platforms lies in the intersection of these two extremes where the content is an amalgamation of cleanly spoken words in the presence of background noise. In such situations, the waveform resembles Figure 1.c where the temporal characteristics are contaminated by the presence of background noise or music, curtailing the discriminative information contained in the temporal domain. This makes it very challenging to grade audio content by just looking at the temporal properties. Thus, simply analyzing the audio waveform across time is not so helpful for audio assessment.

Frequency Analysis

Let us look at the frequency domain analysis for the waveforms. Fast Fourier Transforms (FFT) is one of the most popular approaches for performing frequency analysis.

FFTs decompose the signal into its constituent spectral components to provide the information about the frequency content of the audio signal.

These features are interesting because in some scenarios the signal and noise can be distinguished on the basis of frequency components alone. For instance, the signal may be composed majorly of low-frequency components while noise may be located at higher frequencies. In such cases, a signal having higher frequency components can be classified as noisy. Using low-pass filtering, it may also be possible to de-noise the audio.

Figure 2: FFT plots for High Quality Posts. Clean speech dialogues with intermittent background music tracks.
Figure 3: FFT plots for Low Quality Posts having muffled lyrics. 3(a) Recording by a mobile device playing a song snippet. 3(b) Low volume distant recording of a song played on the television
Figure 4: FFT plots for Low Quality Posts with high frequency components

In Figures 2, 3, and 4, we plot the audio waveforms and their respective FFT plots for high and low quality posts for the purpose of contrast. These plots suggest that while the noisy-low-quality posts are not having very high frequency components, we can observe the noisy samples have higher peaks in higher frequency bands than the high quality posts.

Are all the noise signals high frequency?

However, at this point it is important to highlight that the frequency spectrum of noise is characterized by the noise color.

While pink and brownian noise can have more power at low frequencies, blue and violet noise have more power at high frequencies. White noise on the other hand, has equal power spread out over a range of frequencies.

Figure 4: Spectrum of the colors of Noise — Reference

Thus, using the presence of high frequency components for quality assessment can not help in the scenarios where there is a considerable overlap between signal and different natures of noise in the frequency domain.

Can we mark all the samples with high frequency as low quality?

Another important consideration while grading audio content is the existence of interspersed segments of low quality and high quality in both kinds of posts. For example, in our data, we find it common to have audio clips with noise for a very brief duration while the other parts of the audio are very high quality and entertaining. At times, this noise gets introduced voluntarily to make more impact ex. glass shattering. For such audios it would be unfair to mark them as low-quality, right?

Figure 6: Low Quality audio with interspersed High Quality characteristics (Fan running in the background while the speaker is recording the speech audio. Excluding the segments where the speaker speaks loud enough, the fan sound can be clearly identified)
Figure 7: High Quality audio with interspersed LQ characteristics. (The content dialogue is first spoken and after the punch line a background music track of poor quality is played)

Figure 6 is a classic example of a low quality waveform with multiple interspersed regions resembling high quality audio waveforms. So is the case with the high quality example in Figure 7 due to the presence of a loud background music track at the end of the audio waveform.

However, comparing the FFT plots of such posts show that the global audio-level spectral features are not able to capture fine-grained audio quality as FFTs are not able to model the temporal variations. One possible method of handling this would be to consider shorter segments of the waveform for analysis.

Temporal Modelling of Frequency

In order to perform fine grained temporal modelling, short-term frequency features are extracted over overlapping/non-overlapping windows of the original audio. Short term processing is particularly meaningful when dealing with time-varying signals like speech and music where we assume fixed properties in a finite short-term temporal block. This results in a sequence of hypotheses which can be compounded to provide a single hypothesis for that particular audio sample.

Figure 8: Short-term processing. Process the signals with a sliding window approach.

In the spectral domain, the short-term Fourier transform (STFT) is a popular technique used to analyse non-stationary signals.

STFT involves slicing the signal into smaller time intervals and taking the Fourier transform/FFT of every segment.

Since our use cases deal primarily with spoken audio, we use the popular overlap and shift paradigm used to analyse speech signals with a 25ms window length and 10ms shift.

STFTs are commonly visualized using their log-spectra (20 log_10 STFT{X}), popularly known as the Spectrogram. Spectrograms are two-dimensional time-frequency representations of signals, with a third dimension representing the intensity of the signal as a heatmap at any chosen time and frequency.

Figure 9: Spectrogram of High Quality audio waveforms. 9(a) Dialogue between 3 people speaking one at a time distinctly with gaps in between. 9(b) Single person narrating a joke, spoken clearly
Figure 10: Spectrogram of Low Quality audio waveforms (Recording of a microphone output of a song with a lot of echo and reverberation and recording of a clip played on the television).

Spectogram: The signal strength is visualized by varying the color/brightness of the spectrogram plots. These color-coded values indicate the amplitude of the spectrogram at a point in time. The purple color falls at one extreme of the scale indicating zero intensity while the bright yellow falls at the other extreme of the scale indicating progressively stronger intensities.

From these diagrams, we can observe that spectrograms of the audio samples are able to capture and isolate the areas with high and low frequency components. Thus, we chose STFT as an appropriate representation for analysing the content that gets uploaded on our platform.

We classify each window of the audio as noisy/clean on the basis of frequency component and then aggregate over the complete audio.

Amplitude Modelling

Now that we have a measure to model the temporal and spectral content of the signal, can we take into account the amplitude characteristics, if any, of noise? As discussed previously, some posts have consistent background noises coming from ambient sources like noise of equipment, poor recording device etc. While some of these noise patterns can be separated at the frequency level, others merge with the signal. In such cases, we find that analysing the normalized energy of the post can be useful. The energy of a signal is defined as the square of the amplitude represented by the waveform.

Figure 11: Energy of High Quality audio waveforms. 11(a) Clean speech with humorous narrative 11(b): Film song used as background for the post
Figure 12: Energy of Low Quality audio waveforms (Recording of songs with lyrics from a television output)

From the Figure 12, one common observation is the existence of an additive constant in terms of the amplitude of the signal. While high-quality posts (Figure 11) have regions of zero amplitude also, posts with an overpowering background noise have a consistent amplitude across the duration due to the presence of noise.

Normalizing the energy with the energy corresponding to the frame with least energy helps us to arrive at column 3 where we notice clearly that noisy samples have very less energy. The aggregation of normalized energy across time can be used to understand if the signal is clear or has been muffled by the presence of background noise.

Signal to Noise Ratio (SNR)

Another measure for expressing the quality of a signal is called signal-to-noise ratio (SNR). SNR is the ratio of the true underlying signal amplitude to that of noise — peak to peak height. A cut off frequency, tuned in accordance with the frequency content of the signal at hand, is used for the low-pass filtering of the input signal. The noise is then taken to be the absolute difference between the actual and the filtered signal. The peak-to-peak amplitude of the filtered signal and noise is calculated and their ratio determines the SNR.

Higher SNR signifies that the audio is of higher quality as the signal has higher strength than the noise.

However, the biggest challenge with SNR is that the cut-off frequency needs to be tuned and as discussed previously, noise may very well lie within the frequency band of the signal.

Conclusion

From these analysis, we can see that the noise does not respect fixed characteristics. There are various patterns in which noise manifests itself with the original signal.

Noise is indeed very noisy !!

Thus, we need different mechanisms to detect these noise patterns. High frequency noise patterns can be detected using frequency analysis while low frequency noise which overlaps with the signal can be detected using energy in cases when it is consistent across the duration of the signal. Thus, the success of a robust system lies in using a combination of all these techniques to arrive at a final conclusion. In the next part of this blog, we will discuss more techniques for audio analysis like MFCC, THD, PESQ etc. and their combination which will be useful in further improving the efficacy of audio quality assessment.

--

--