A Deep Dive into the Wav2Vec2 Application on Turkish Broadcasts: Changing the Landscape of Audio Classification

All the details of our research on Classification of Turkish Broadcast News and Advertising Jingles with Wav2Vec2

Ferhat Demirkiran
Kodiks
10 min readNov 8, 2023

--

We all know that if we hear the familiar sounds of a jingle on TV, we can often tell if it’s for the news or a commercial without even looking. For instance, if you hear a particular tune, you might recognize it as the lead-in to a news broadcast, while a different tune might signal the start of a commercial break.

To illustrate, imagine the following scenario:

You are at home, working in your kitchen with the television playing in the background. You’re focused on your task, not actively watching the screen. Suddenly, a sequence of bold, structured notes cuts through the noise, it’s the jingle that signals the evening news. Without a glance, you know it’s 7 p.m., the time for the daily news round-up. The consistency and timing of the jingle, which you’ve come to recognize over time, inform you of this without any visual cue. To give you a better understanding, I will play two examples of news jingles.

Jingle — News
Jingle — News 2
News Jingle

Later, as you’re chopping vegetables, the room fills with a lively melody that’s different from before; it’s playful and rhythmic. You don’t need to look up to know that the news segment has ended and the commercial break has begun. For a clearer illustration, listen to these two examples of commercial break jingles.

Jingle — Commercial
Jingle — Commercial 2
TV — Commercial Jingle, Food Jingle, Sponsorship, Smart Signs

Drawing from this understanding, it’s reasonable to propose that if humans can discern between news and advertisement jingles, so too could artificial intelligence (AI). The key lies in identifying and extracting the distinctive features that characterize those types of jingles.

Now, one might wonder why such differentiation is crucial. From a media perspective, the ability to distinguish these audio signals is essential for efficient media management and the improvement of user experiences. Television broadcasts today offer a wide spectrum of content, from news and commercials to series and movies. The extraction of Electronic Program Guide (EPG) data plays a vital role, as it helps classify programs by genre, title, and airing times. This not only makes it easier for viewers to locate their shows of interest but also enhances the way broadcasters manage, archive, and organize their programming.

Artificial Intelligence (AI) can manage to distinguish those jingles by analyzing audio signals. This is done through feature extraction, where traditional machine learning algorithms process the unique characteristics of audio data.

Here’s how it works:

  • Feature Extraction: Audio processing techniques can be leveraged to extract features from the broadcast’s audio track.
  • Pattern Recognition: These features are then used to identify patterns. For example, news jingles may have a certain pattern of tones and speech, while commercial break jingles might feature varied pitch and music patterns.
  • Machine Learning Models: The extracted features are fed into machine learning models, which have been trained on labeled data to recognize and categorize different audio signatures.

Dataset

The study utilizes a dataset comprising samples of commercial break jingles and lead-in news jingles collected from 57 varied television channels in Turkey. The overall dataset includes audio signals of 72 commercial break jingles and 110 news jingles, varying in length. The sample rate of each audio file is set to 16000 Hz.

The distribution of audio lengths for a collection of advertisement and news jingles is given below:

  • X-axis Label: Length of Audio (Seconds)
  • Y-axis Label: Total Count
Distribution of the audio file lengths.

From this histogram, we observe that a majority of the audio samples are under 10 seconds in length, with the frequency gradually decreasing as the length of the audio increases.

DATA AUGMENTATION

Data augmentation plays a vital role in enhancing machine learning and deep learning models. It bolsters the dataset used for training by generating new examples through various alterations, which is essential when dealing with a limited quantity of data.

In the field of audio analysis, we leverage certain augmentation techniques like Random Gain, Noise Addition, and Pitch Shifting to enrich our dataset since those techniques have increased the model performance on audio data for various tasks [1][2]. Random Gain randomly adjusts the volume of the audio clips, while Noise Addition incorporates varying levels of background noise. Pitch Shifting, on the other hand, changes the pitch of the audio without altering its duration. These techniques not only introduce variation but also simulate real-world scenarios where audio data may come with different noise levels, quality, and pitch variations.

For our project, we began with 182 original audio clips. To prepare our model, we assigned 70% of these for training purposes. With the help of data augmentation, we enhanced our initial set of 127 training examples to a robust collection of 508 samples.

Feature Extraction

In the realm of audio signal processing, there is a rich tapestry of features that can be extracted to delve into the sonic signatures of different sounds.

Here are the features used in our study:

  • MFCC coefficients: Mel-Frequency Cepstral Coefficients represent the spectral envelope of an audio signal by capturing the frequency bands that are more relevant to human auditory perception.
  • Root Mean Square (RMS): RMS is a measure of the average power or energy in an audio signal. It represents the overall amplitude level of the signal.
  • Zero Crossing Rate (ZCR): ZCR measures the rate at which an audio signal changes its sign (from positive to negative or vice versa). It provides information about the frequency of signal changes, which can indicate the signal’s timbre or noisiness.
  • Spectral Bandwidth: Spectral Bandwidth represents the range of frequencies in an audio signal. The average value gives an indication of the spread of frequencies in the signal.
  • Spectral Centroid: The Spectral Centroid is the center of gravity of the frequency distribution in an audio signal. It provides information about the average frequency content or tonal center of the signal.
  • Spectrogram average: The average spectrogram represents the average frequency distribution in the signal.
  • Spectrogram: The Spectrogram is like a topographical map for sound, showing how frequencies unfold over time. In deep learning, spectrograms often serve as input data, providing a rich, visual-like representation that neural networks can learn from.

In our study, we did not incorporate feature selection methods because the number of features extracted was sufficiently manageable to allow for an extensive exploration of various combinations. This approach enabled us to directly experiment with different sets of features and empirically select the combination that yielded the best performance for our models.

Experiment and Results

In the initial phase of our study, we applied traditional machine learning algorithms, assessing their performance using a variety of feature sets. These features included Mel-Frequency Cepstral Coefficients (MFCC), Root Mean Square (RMS), Zero Crossing Rate (ZCR), Spectral Bandwidth, Spectral Centroid, and averages derived from Spectrograms.

The datasets were divided into three parts: training, validation, and testing. Since the dataset size was limited, 30% of the data was allocated for testing. The data-splitting process was performed in a stratified way to preserve the class distribution. A stratified 5-fold strategy was applied to the training data for each dataset, and within each iteration, 20% of the training data was set aside for validation. This approach ensured that each fold had an equal distribution of news and advertisement jingles, and every sample from the dataset had the opportunity to appear in both the training and validation data.

The evaluation metrics used in the study were calculated by taking the mean and standard deviation of the 5 validation results. This provided an overall measure of performance for each algorithm and dataset combination.

As a simple baseline, a dummy classifier was employed using the “most frequent” strategy. Considering the imbalance in the dataset, the micro F1 score is preferred. The comparison results of the base models on each dataset with the best combination of features are presented below:

Base model comparison results for the original dataset.
Base model comparison results for the augmented dataset.

In the provided results, we observe that traditional machine-learning models demonstrate learning capabilities when benchmarked against the dummy model. The models include XGBoost Classifier, Logistic Regression, Decision Tree, K Neighbors Classifier, and Random Forest, each showing varying degrees of accuracy and F1 scores on both validation and test sets. These results indicate that even without complex architectures, these models can discern patterns within the dataset.

Given the substantial learning shown by these traditional models, we anticipated that convolutional neural networks (CNN) could further enhance performance by leveraging spectrogram data. Spectrograms convert audio into a visual format that encapsulates time, frequency, and intensity, presenting a form that CNNs, which excel at extracting patterns from visual inputs, can utilize effectively.

In the documentary produced to commemorate the 100th anniversary of the Turkish Republic on October 29, 2023, a Mel spectrogram, which is a type of spectrogram where the frequency scale has been converted to the Mel scale, was utilized. This technology translates audio into a detailed visual format, capturing the essence of sound. The model, trained on this spectrogram, analyzed the musical tastes of Mustafa Kemal Atatürk, the revered founder of the Republic. It was designed to predict which modern-day songs Atatürk might favor. This approach not only honored Atatürk’s well-documented love for music but also highlighted a fusion of historical interests with contemporary technology for the centenary celebrations.

On the other hand, in the context of our study, where the goal is to differentiate between jingles on Turkish TV channels, Wav2vec 2.0 [3] offers a promising approach. The model’s ability to learn from the raw audio can potentially lead to more accurate classification results. Wav2vec 2.0 is a state-of-the-art self-supervised learning framework developed by Facebook AI. Unlike traditional models that rely on labeled data, wav2vec 2.0 learns directly from raw audio waveforms. It is designed to pre-train on large volumes of unlabeled audio data and then fine-tune on smaller amounts of labeled data.

By incorporating wav2vec 2.0 into our study, we aim to leverage its advanced self-supervised learning capabilities to enhance our audio classification framework. This approach allows us to compare its performance with that of CNNs and traditional machine learning models, providing a comprehensive evaluation of different methodologies in audio signal classification.

Confusion matrices offer an illustrative overview of each model’s capability to differentiate between audio clips of news and advertisements. Hence, we utilized these matrices to assess the comparative performance of the CNN and Wav2Vec2 models against the top-performing baseline models, which are the XGB Classifier and the Random Forest.

Confusion matrix results of best baseline models, CNN, and Wav2Vec2.

By comparing these results, we can infer that while traditional machine learning models like the XGB Classifier and Random Forest are able to achieve a balance between recognizing news and advertisements, CNN tends to be biased towards classifying samples as news. In contrast, Wav2Vec 2.0 exhibits a superior balance, with minimal misclassifications in both categories. This comparison highlights the potential of self-supervised learning models like Wav2Vec 2.0 in handling complex audio classification tasks

Time Interval Analysis of Audio Signals

Inspecting the confusion matrix reveals that the wav2vec2.0 model surpassed its counterparts in classification capability. Consequently, further assessments of wav2vec2.0’s performance were undertaken by testing it over various time and positional intervals s to determine the optimal time range that best represents a jingle.

Each audio file in the dataset consists of a minimum of 3 seconds and a maximum of 39 seconds, comprising both news and advertisement jingles as seen in the Dataset section.

A variety of time and segment combinations were explored rather than employing full-length news or advertisement audio tracks. Interval lengths of 2, 4, 6, 8, 10, and 12 seconds were trialed, with each interval focusing on different sections — beginning, middle, and end — of the audio samples. Take, for example, a 13-second ad jingle which was segmented into its initial, middle, and final 8-second parts to assess their representativeness of the jingle as a whole. If an interval’s duration surpassed that of the original jingle, then the jingle was used in its entirety without any cuts.

Respectively first, middle, and last second interval results based on accuracy and F1 score are given below:

Time Interval Analysis Results.

It can be concluded that the model achieves higher accuracy and F1 scores in the later time intervals compared to the earlier time intervals. Additionally, the 10-second time interval consistently demonstrates the best performance across all three sets of results. The first time interval with a duration of 10 seconds yields the highest accuracy (98.18%) and F1-score (97.67%) among all the time intervals considered in the analysis. The corresponding confusion matrix is given below:

Confusion Matrix of Best Time Interval

Based on the confusion matrix, there was only one misclassification where a sample of news was mistakenly classified as an advertisement.

Our findings indicate that the Wav2Vec2.0 model, which leverages pre-training, surpassed its counterparts with an impressive accuracy of 96.36%. Additionally, Our study pinpointed specific time frames, particularly the initial 10 seconds of the audio clips, as the most indicative intervals for precise classification, with a 98.18% success rate within this time frame.

Please follow the link provided below to access our conference paper.

For an in-depth look at our projects and innovative engineering solutions, please visit Kodiks Bilişim’s Linkedln page.

REFERENCES

  1. L. Nanni, G. Maguolo, and M. Paci, “Data augmentation approaches for improving animal audio classification,” Ecological Informatics, vol. 57, p. 101084, 2020.
  2. J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017.
  3. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

--

--