ailia Audio: A Library for Audio Pre-processing and Post-processing

David Cochard
axinc-ai
Published in
4 min readJan 9, 2024

This is an introduction to ailia Audio, a library designed for audio pre-processing and post-processing to on-device AI audio processing easier.

Overview

In recent years, the use of machine learning in the field of audio has been increasing, specifically in areas such as voice recognition, speech synthesis, pitch estimation, and noise reduction.

However, when dealing with audio, it is common to apply a mel-spectrogram transformation to the input audio waveform. This process has typically been carried out using external libraries like torch.audio or librosa, not within the AI model itself.

As a result, even though AI models can be converted to ONNX format, implementing the mel-spectrogram transformation preprocessing was necessary for running these models on iOS or Android. This transformation requires the implementation of FFT (Fast Fourier Transform) and other complex processes, demanding significant development effort.

ailia Audio is a library that solves this problem. By providing various APIs compatible with torch.audio and librosa, including mel-spectrogram transformation, ailia Audio makes it easy to implement audio processing AI models written in Python on iOS and Android platforms.

Usage examples

The voice recognition model Whisper requires the calculation of a log_melspectrum from the input audio waveform. With ailia Audio, it becomes possible to implement Whisper on iOS and Android platforms.

Implementation with librosa

In librosa, log_melspectrum is calculated by first calling librosa.stft and librosa.filters.mel, and then computing the logarithm (base 10) of the result.

Implementation with ailia Audio

When using ailia Audio, you can calculate the log_melspectrum by calling ailia.audio.mel_spectrogram and then computing the logarithm (base 10) of the result.

ailia Audio bindings

librosa and torch.audio are implemented in Python. In contrast, ailia Audio is implemented in C++, which allows for native access from C++ and Objective-C. Additionally, it can be accessed from C# and Dart through FFI (Foreign Function Interface). For verification purposes, Python bindings are also provided. Therefore, like the ailia SDK, it can be used across various platforms. This versatility makes ailia Audio a highly adaptable solution for audio processing in diverse application environments.

ailia Audio API

ailia Audio includes the following audio processing implementations:

FFT、IFFT
・Spectrogram、InverseSpectrogram
・MelSpectrogram
・MagPhase、Standardize、ComplexNorm、ConvertToMel、dB
・Resample
・LinearFilter、LinearFilterZiCoef、FilterFilter

The C++ API reference if available at the link below:

API reference in Python:

ailia Audio samples

There are examples of using ailia Audio from C++ and Unity as follows:

For samples using ailia Audio from Python, you can refer to ailia MODELS where implementations for both librosa/torch.audio and ailia Audio are provided, allowing for easy switching between them. This setup is useful for understanding how to port from librosa or torch.audio to ailia Audio, as you can compare both implementations side by side. The files with _ailia in their names are the ones using ailia Audio, while those without it are the original implementations using librosa or torch.audio.

CRNN Audio Classification

Pytorch DC TTS (Text-To-Speech)

Whisper (Voice Recognition)

Conclusion

By using ailia Audio, it becomes possible to implement AI models for audio processing on iOS and Android. ax Inc. also offers services to port your models to iOS and Android, if you are interested, feel free to contact us for any inquiry.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR