Accelerating SpeechBrain emotion recognition using OpenVINO™ and NNCF

Pradeep Sakhamoori

Published in

OpenVINO-toolkit

8 min readJun 25, 2024

Author: Pradeep Sakhamoori

1. Introduction:

1.1. SpeechBrain:

SpeechBrain is an open-source toolkit designed for various speech-processing tasks such as speech recognition, speaker identification, and emotion recognition. It offers pre-trained models and customizable tools, making it ideal for research and development. In this blog, we will walk-through the process of model preparation of the SpeechBrain Emotion Recognition (ER) model, which is a fine-tuned wav2vec2 (base) model trained on IEMOCAP training data using the Intel® Distribution of OpenVINO™ Toolkit and the quantization pipeline of the OpenVINO IR model with Neural Network Compression Framework (NNCF) in accelerating model performance in classifying “happy, “neutral,” “angry” and “sad” emotions.

1.2. Intel® OpenVINO™ Toolkit:

OpenVINO™ toolkit (Open Visual Inference and Neural Network Optimization) is aimed at optimizing and deploying deep learning models on hardware platforms. It converts models into an intermediate representation (IR) for efficient execution on CPUs, GPUs, VPUs, and other accelerators, and supports multiple deep-learning frameworks. See here to learn more about OpenVINO LLM-specific APIs and Enhanced serving capabilities.

1.3. Neural Network Compression Framework (NNCF):

NNCF is an open-source library that provides advanced compression algorithms for neural networks. It includes techniques like quantization, pruning, and sparsity to reduce model size and improve inference speed while maintaining accuracy, which is essential for deploying models on resource-constrained devices.

Quantization reduces the precision of model parameters, decreasing model size and enhancing inference speed with minimal accuracy loss. This optimization is crucial for deploying models on edge devices with limited computational resources, significantly improving performance and efficiency.

2. SpeechBrain Model Optimization:

In this section, we provide the details on how to perform SpeechBrain wav2vec2 emotion recognition model optimization.

2.1. Installation and setup:

Before installing and setting up, ensure your system meets the minimum requirements.

Setup a Python virtual environment using either Python venv or using conda. Here, we illustrate using the Python virtual environment.

For installation instructions, Click Here.

Steps to create and activate Python virtual environment:

python3 -m venv sb
source ./sb/bin/activate

2.2. Install SpeechBrain, OpenVINO™ , NNCF, and dependencies:

# Install openvino and nncf
pip install openvino>=2024.1.0
pip install nncf

Below are installations for speechbrain, torch (for cpu) and other dependencies

pip install “speechbrain>=1.0.0” — extra-index-url https://download.pytorch.org/whl/cpu

pip install — upgrade — force-reinstall torch torchaudio — index-url https://download.pytorch.org/whl/cpu

pip install “transformers>=4.30.0” “huggingface_hub>=0.8.0” “SoundFile”

2.3. Convert the model to OpenVINO format:

Below is a code snippet illustrating the conversion of the SpeechBrain PyTorch model to OpenVINO IR format using openvino.convert_model Python API.

import openvino as ov
ov_fp32_model = ov.convert_model(pt_model, example_input=input_tensor)

pt_model: SpeechBrain emotion recognition wav2vec2 pytorch model

input_tensor: Pre-processed speech sample (wav2vec2 model trained with audio sampled at 16KHz)

Refer to Appendix A for the end-to-end code of running model conversion, quantization, and inference with OpenVINO runtime.

2.4. [Optional] Saving OpenVINO™ (IR) model files:

Below is a sample usage of Python API on how to save the OpenVINO IR model file to disk.

import openvino as ov
ov.save_model(ov_fp32_model,<dir_path>/sb_emotion_recognition_ov_fp32.xml)

3. Model quantization with NNCF:

3.1. Dataset:

The RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset is a collection of 7,356 audio recordings with acted emotional content. It contains eight emotional categories: calm, neutral, happy, sad, angry, fearful, surprised, and disgusted, expressed at two intensity levels (regular and strong). The dataset is gender-balanced, with 24 actors vocalizing two lexically matched statements in a neutral North American accent.

We used a publicly available dataset published here.

3.2. Data Pre-processing and Calibration step:

The code snippet below is the data pre-processing step for input samples. Loading audio files with torchaudio and performing data normalization.

signal, sr = torchaudio.load(str(file_path), channels_first=False)
norm_audio = classifier.audio_normalizer(signal, sr)

We used the input samples with “03–01–01-”, “03–01–03-”, “03–01–04-”, and “03–01–05-” tags from various actors folders in the dataset for nncf data calibration step, covering samples from below categories.

audio-only(03)
speech(01)
neutral(01)/happy(03)/sad(04)/(05)angry

Below is the code snippet of the data calibration step with nncf API:

dataset = AudioDataset(wav_files)
collate_fn = functools.partial(transform_fn, dir_path=dataset_dir)
calibration_loader = DataLoader(dataset, batch_size=batch_size, \
                     shuffle=False, collate_fn=collate_fn)
calibration_dataset = nncf.Dataset(calibration_loader)

and the transform function:

def transform_fn(data_items, dir_path):
  norm_audios = []
  for file_name in data_items:
      file_path = os.path.join(dir_path, file_name)
      signal, sr = torchaudio.load(str(file_path), channels_first=False)
      norm_audio = classifier.audio_normalizer(signal, sr)
      norm_audio = norm_audio.unsqueeze(0)
      norm_audios.append(norm_audio)
  return norm_audios

The NNCF pipeline allows for efficient model compression and quantization, which can improve inference performance and reduce memory footprint while maintaining accuracy. By calibrating and quantizing the RAVDESS dataset using NNCF, we aimed to optimize the dataset for efficient emotion recognition models.

3.3. NNCF Quantization step FP32 to INT8:

NNCF provides a suite of post-training and training-time algorithms for neural network inference optimization in OpenVINO™ with minimal accuracy drop.

NNCF is designed to work with models from PyTorch, TensorFlow, ONNX and OpenVINO™.
NNCF provides samples that demonstrate the usage of compression algorithms for different use cases and models. See compression results achievable with the NNCF-powered samples at Model Zoo page.

With calibrated data from step 3.3, we the initiate of nncf quantize step on OpenVINO FP32 model generated in step 2.4

quantized_model = nncf.quantize(ov_fp32_model, calibration_dataset)

4. Running model inference with OpenVINO Runtime:

OpenVINO Runtime uses a plugin architecture. Its plugins are software components that contain complete implementation for inference on a particular Intel® hardware device: CPU, GPU, NPU, etc. Each plugin implements the unified API and provides additional hardware-specific APIs for configuring devices or API interoperability between OpenVINO Runtime and the underlying plugin backend.

The above scheme illustrates the typical workflow for deploying a trained deep-learning model

After converting the model to OpenVINO format, compile the converted model for your target device and run inference. For details on Inference Devices and Modes, see optimize-inference.

Below is a code snippet on how to compile the OpenVINO model for a given target device (default set to CPU) and set of optimization parameters:

opts = {"device_name": device, "PERFORMANCE_HINT":"LATENCY"}
compiled_model = core.compile_model(ov_model, config=opts)
output = compiled_model.outputs[0]

and here is how we run the inference step with input audio sample:

output_t = compiled_model(inp_sample_audio)[output]

5. Running model conversion and quantization pipeline script:

We use a script that can be used for model conversion and compare the performance of OpenVINO™ FP32 and INT8 models of SpeechBrain emotion recognition.

Refer Appendix A for speechbrain_ov_pipeline.py

Usage

$python speechbrain_ov_pipeline.py -h

usage: speechbrain_ov_pipeline.py [-h] [-i SPEECH_SAMPLES_DIR] 
[-ovfp32 OV_FP32_MODEL] [-ovint8 OV_INT8_MODEL] [-q NNCF_QUANTIZE] 
[-ds DATASET_DIR] [-ov SAVE_OV_MODELS] [-d DEVICE] [-b BATCH_SIZE] 
[-w WARMUP_TIME]

Script to run speechbrain emotion recognition

options:
-h, --help show this help message and exit
-i SPEECH_SAMPLES_DIR, --speech_samples_dir SPEECH_SAMPLES_DIR
            Path to the speech samples directory
-ovfp32 OV_FP32_MODEL, --ov_fp32_model OV_FP32_MODEL
            Path to emotion recognition OpenVINO FP32 model
-ovint8 OV_INT8_MODEL, --ov_int8_model OV_INT8_MODEL
            Path to emotion recognition OpenVINO INT8 model
-q NNCF_QUANTIZE, --nncf_quantize NNCF_QUANTIZE
            Perform nncf post-training quantization
-ds DATASET_DIR, --dataset_dir DATASET_DIR
            Speech samples dataset for nncf calibration
-ov SAVE_OV_MODELS, --save_ov_models SAVE_OV_MODELS
            Flag to save OpenVINO IR models
-d DEVICE, --device DEVICE
            OpenVINO target inference device
-b BATCH_SIZE, --batch_size BATCH_SIZE
            Input batch size
-w WARMUP_TIME, --warmup_time WARMUP_TIME
            Inference warmup time

Before running the script, we need to prepare the “dataset” and “test” samples directory.

Step 1: Prepare the “dataset” directory:

As detailed in section 3.2, we used samples that cover labels relating to “neutral,” “happy,” “sad,” and “angry” from RAVDESS emotion speech audio hosted here on Kaggle. These samples are used for the nncf quantization step.

These samples are hosted under the “./samples/nncf” directory path.

Step 2: Prepare the “test” directory:

We used a couple of samples (hap.wav, anger.wav) from here. These samples are used for openvino inference performance analysis.

These samples are hosted under the “./samples/test” directory path.

Step 3: Running the speechbrain_ov_pipeline.pyscript:

The following is the usage of running script speechbrain_ov_pipeline.py . This performs model conversion, quantization, and model inference with the device set to CPU by default, with both FP32 and INT8 models saved to disk. For details on inference devices, see optimize-inference.

python speechbrain_ov_pipeline.py \ 
-i ./samples/test/ \
-q 1 \
-ds ./samples/nncf/ \
-d CPU \
-b 1 \
-ov 1

The following is the usage of running model inference (device set to CPU) pipeline with “test” samples, results with FP32 and INT8 saved models.

python speechbrain_ov_pipeline.py \ 
-i ./samples/test \
-ovfp32 ./<path_ov_fp32_model_dir>/speechbrain_emotion_recog.xml \
-q 0 \
-ovint8 ./<path_ov_int8_model_dir>/speechbrain_emotion_recog.xml \
-d CPU -b 1

5.1. Results summary:

Below are the sample output logs of the script speechbrain_ov_pipeline.py of running model conversion, NNCF quantization, and OpenVINO model inference with the test sample.

python speechbrain_ov_pipeline.py -i ./samples/test/ -q 1 
                     -ds ./samples/nncf/ -d CPU -b 1 -ov 1
INFO:nncf:NNCF initialized successfully. 

Supported frameworks detected: torch, openvino
Some weights of Wav2Vec2Model were not initialized from the model 
checkpoint at facebook/wav2vec2-base and are newly initialized: 
['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0',
 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']

You should probably TRAIN this model on a down-stream task to be able to 
use it for predictions and inference.
speechbrain.lobes.models.huggingface_transformers.huggingface
 - Wav2Vec2Model is frozen.

[INFO] OV model file not found
[INFO] Starting model conversion process
input_tensor.shape = torch.Size([1, 100533])
[SUCCESS] Model conversion process completed
[INFO] OpenVINO IR saved at: ./openvino_model/fp32/speechbrain_emotion_recog.xml
[INFO] Started NNCF dataset calibration step
[INFO] NNCF model quantization (INT8) initiated
Statistics collection ━━━━━━━━━━━━━━━━━━━━ 100% 164/164 • 0:00:36 • 0:00:00
Applying Fast Bias correction ━━━━━━━━━━━━━━━━━ 100% 73/73 • 0:00:03 • 0:00:00
[SUCCESS] NNCF model quantization (INT8) process finished
========================================
[INFO] OpenVINO inference with FP32 model
========================================
[INFO] Inference device selected: CPU
[INFO] Input sample: anger.wav
[INFO] Output label = ['ang']
[INFO] Input sample: hap.wav
[INFO] Output label = ['hap']
[INFO] Total inference time in sec: 0.51
=======================================================
[INFO] OpenVINO inference with NNCF Quantized INT8 model
=======================================================
[INFO] Inference device selected: CPU
[INFO] Input sample: anger.wav
[INFO] Output label = ['ang']
[INFO] Input sample: hap.wav
[INFO] Output label = ['hap']
[INFO] Total inference time in sec: 0.18

The NNCF quantized OpenVINO model significantly enhanced the performance of the SpeechBrain Wav2Vec2 emotion recognition model. From the logs, we can see that the quantized INT8 model reduced the total inference time from 0.51 seconds to 0.18 seconds, indicating a substantial improvement in performance. This acceleration allows for faster processing and response times while maintaining the accuracy of the emotion recognition outputs.

The accuracy of the quantized model largely depends on the quality and diversity of the dataset samples used for the calibration process.

Conclusion:

Intel® OpenVINO™ Toolkit provides an easy way to optimize deep learning models improving performance on various hardware ranging from edge to data center. OpenVINO™ employs a write-once, deploy anywhere paradigm enabling the developers to build optimized models ready for deployment quickly. In the above sections, we showcased the model conversion and quantization process on the SpeechBrain Emotion Recognition wav2vec2 model to achieve a boost in inference performance on an Intel® Core(TM) Ultra 7 155H with a target device set to CPU. We could further accelerate the model performance by advanced optimization techniques using OpenVINO runtime optimizations. Model optimization and NNCF quantization are just a couple of the features of OpenVINO™ Toolkit, among many others: OpenVINO™ model server, model zoo, etc. We encourage you to try out OpenVINO ™ in your next AI project.

Notices and Disclaimers:

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)” by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

Performance varies by use, configuration, and other factors. Learn more at www.intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details.
No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. 

Appendix A:

Software Configuration(s)

see speechbrain_ov_pipeline.py below