ailia AI Speech : Speech Recognition Library for Unity and C++
Introducing ailia AI Speech, an AI speech recognition library which allows you to easily implement speech recognition in your applications.
ailia AI Speech offersC# bindings to be used with Unity, but it can also be used directly using the C API for native applications.
Main features
Supports 99 languages
It uses a multilingual AI model and is capable of recognizing 99 languages, including Japanese, Chinese, and English.
Works offline
Speech recognition can be performed on-device CPU, without the need for cloud computing.
Supports mobile devices
Speech recognition can be performed not only on PCs, but also on iOS and Android mobile devices.
Unity combatibility
In addition to the C API, a Unity Plugin is provided, making it easy to implement speech recognition in Unity-based applications.
Translation to English
Real-time translation into English can be performed simultaneously with speech recognition, including from Japanese and Chinese.
Demo
We provide a demo developed with Unity showing real-time speech recognition and live translation from an audio file or microphone. This demo is developed using Unity.
A demo application for Windows can be downloaded from the following download URL. To run the demo, the ailia SDK license file must be placed in the same folder as the executable file.
The demo app for macOS can be downloaded from the download URL below. To run the demo, you need to place the ailia SDK license file in ~/Library/SHALO/
Microphone performance is critical in speech recognition. MacBook microphones perform relatively well, Windows users must make sure to use a decent microphone for best results.
For iOS and Android, voice recognition can be evaluated in the ailia AI Showcase, which is available in the app store and the Google Play Store.
The challenges ailia AI Speech solves
AI-based speech recognition system usually require to use Python. In addition, external dependent libraries are required for audio preprocessing and textualization post-processing, which makes it difficult to implement AI-based speech recognition in native apps, and it has been common to perform speech recognition on the server.
ailia AI Speech solves this problem by creating a library of entire AI speech recognizers, including speech preprocessing and textualization post-processing. You can implement AI-based speech recognition offline, without the need for Python and without the need for a server. It is possible to convert audio that cannot be uploaded to a server for security reasons, such as minutes of a meeting, into text.
Live conversion is also implemented as a unique feature, allowing real-time voice input from a microphone.
The technology behind ailia AI Speech
Architecture
ailia AI Speech uses ailia.audio
for voice preprocessing, ailia SDK for fast AI inference, and ailia.tokenizer
to convert inference results into text. Those modules are bound to ailia.speech
C API, and then bindings were created for Unity.
Supported platforms
ailia AI Speech runs on Windows, macOS, iOS, Android, and Linux.
API
The API reference for ailia AI Speech is available below.
C++
C# (Unity)
Usage
You first create an instance with ailiaSpeechCreate
, open the model with ailiaSpeechOpenModelFile
, input PCM with ailiaSpeechPushInputData
, check if enough PCM has been input with ailiaSpeechBuffered
, convert to text with ailiaSpeechTranscribe
, and get recognition results with ailiaSpeechGetText
.
ailiaSpeechPushInputData
does not require the input of the entire voice at once, but can supply the voice little by little and receive real-time input from the microphone.
ailiaSetSilentThreshold
can be used to perform text transcription after a certain period of silence.
C++ sample
#include "ailia.h"
#include "ailia_audio.h"
#include "ailia_speech.h"
#include "ailia_speech_util.h"
void main(void){
// Instance Creation
struct AILIASpeech* net;
AILIASpeechApiCallback callback = ailiaSpeechUtilGetCallback();
int memory_mode = AILIA_MEMORY_REDUCE_CONSTANT | AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | AILIA_MEMORY_REUSE_INTERSTAGE;
ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_NONE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);
// Model file loading
ailiaSpeechOpenModelFileA(net, "encoder_small.onnx", "decoder_small_fix_kv_cache.onnx", AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);
// Language Settings
ailiaSpeechSetLanguage(net, "ja");
// PCM bulk input
ailiaSpeechPushInputData(net, pPcm, nChannels, nSamples, sampleRate);
// Transcription
while(true){
// Was sufficient PCM supplied for texting?
unsigned int buffered = 0;
ailiaSpeechBuffered(net, &buffered);
if (buffered == 1){
// Start Transcription
ailiaSpeechTranscribe(net);
// Get the number of texts that could be retrieved
unsigned int count = 0;
ailiaSpeechGetTextCount(net, &count);
// Get transcription result
for (unsigned int idx = 0; idx < count; idx++){
AILIASpeechText text;
ailiaSpeechGetText(net, &text, AILIA_SPEECH_TEXT_VERSION, idx);
float cur_time = text.time_stamp_begin;
float next_time = text.time_stamp_end;
printf("[%02d:%02d.%03d --> %02d:%02d.%03d] ", (int)cur_time/60%60,(int)cur_time%60, (int)(cur_time*1000)%1000, (int)next_time/60%60,(int)next_time%60, (int)(next_time*1000)%1000);
printf("%s\n", text.text);
}
}
// Check whether all PCMs have been processed or not
unsigned int complete = 0;
ailiaSpeechComplete(net, &complete);
if (complete == 1){
break;
}
}
// Instance release
ailiaSpeechDestroy(net);
}
C# sample
// Instance Creation
IntPtr net = IntPtr.Zero;
AiliaSpeech.AILIASpeechApiCallback callback = AiliaSpeech.GetCallback();
int memory_mode = Ailia.AILIA_MEMORY_REDUCE_CONSTANT | Ailia.AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | Ailia.AILIA_MEMORY_REUSE_INTERSTAGE;
AiliaSpeech.ailiaSpeechCreate(ref net, env_id, Ailia.AILIA_MULTITHREAD_AUTO, memory_mode, AiliaSpeech.AILIA_SPEECH_TASK_TRANSCRIBE, AiliaSpeech.AILIA_SPEECH_FLAG_NONE, callback, AiliaSpeech.AILIA_SPEECH_API_CALLBACK_VERSION);
string base_path = Application.streamingAssetsPath+"/";
AiliaSpeech.ailiaSpeechOpenModelFile(net, base_path + "encoder_small.onnx", base_path + "decoder_small_fix_kv_cache.onnx", AiliaSpeech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);
// Language Settings
AiliaSpeech.ailiaSpeechSetLanguage(net, "ja");
// PCM bulk input
AiliaSpeech.ailiaSpeechPushInputData(net, samples_buf, threadChannels, (uint)samples_buf.Length / threadChannels, threadFrequency);
// Transcription
while (true){
// Was sufficient PCM supplied for texting?
uint buffered = 0;
AiliaSpeech.ailiaSpeechBuffered(net, ref buffered);
if (buffered == 1){
// Start Transcription
AiliaSpeech.ailiaSpeechTranscribe(net);
// Get the number of texts that could be retrieved
uint count = 0;
AiliaSpeech.ailiaSpeechGetTextCount(net, ref count);
// Get transcription result
for (uint idx = 0; idx < count; idx++){
AiliaSpeech.AILIASpeechText text = new AiliaSpeech.AILIASpeechText();
AiliaSpeech.ailiaSpeechGetText(net, text, AiliaSpeech.AILIA_SPEECH_TEXT_VERSION, idx);
float cur_time = text.time_stamp_begin;
float next_time = text.time_stamp_end;
Debug.Log(Marshal.PtrToStringAnsi(text.text));
}
}
// Check whether all PCMs have been processed or not
uint complete = 0;
AiliaSpeech.ailiaSpeechComplete(net, ref complete);
if (complete == 1){
break;
}
}
// Instance release
AiliaSpeech.ailiaSpeechDestroy(net);
Unity also offers AiliaSpeechModel
, an abstraction of the AiliaSpeech API, allowing for multi-threaded text transcription.
void OnEnable(){
// Instance Creation
AiliaSpeechModel ailia_speech = new AiliaSpeechModel();
ailia_speech.Open(asset_path + "/" + encoder_path, asset_path + "/" + decoder_path, env_id, api_model_type, task, flag, language);
}
void Update(){
// Retrieve microphone intput
float [] waveData = GetMicInput();
// Queue data if processing is ongoing
if (ailia_speech.IsProcessing()){
waveQueue.Add(waveData); // queuing
return;
}
// Retrieve multi-threaded processing results
List<string> results = ailia_speech.GetResults();
for (uint idx = 0; idx < results.Count; idx++){
string text = results[(int)idx];
string display_text = text + "\n";
content_text = content_text + display_text;
}
// Request new inference if inference has been completed
ailia_speech.Transcribe(waveQueue, frequency, channels, complete);
waveQueue = new List<float[]>(); // Initialize queue
}
void OnDisable(){
// Instance release
ailia_speech.Close();
}
Notes regarding the building process
The Unity Plugin sample uses the StandaloneFileBrowser asset for the file dialog. Due to a limitation of StandaloneFileBrowser, an error occurs when building with il2cpp
in a Windows environment, so please build with mono
. This is not a limitation of ailia AI Speech, so il2cpp
can be used if the file dialog is not used.
If you want to run on iOS, please specify Increased Memory Limit
in the Capability
section in xCode, because the Small
model requires about 1.82 GB of memory.
Underlying model
The AI model used for ailia AI Speech is Whisper, developed by Open AI, which supports 99 languages.
Model optimizations
The underlying Whisper model has been adjusted with a smaller beam size to speed up the inference and perform real-time recognition. In addition, the shape of kv_cache
is fixed during the conversion of ONNX, so that memory is not re-allocated between inferences.
For the ONNX output from Pytorch, ReduceMean -> Sub -> Pow -> ReduceMean -> Add -> Sqrt -> Div
is integrated into MeanVarianceNormalization
by passing it through ailia’s ONNX Optimizer.
Real-time recognition (live mode)
A feature that is not present in the official Whisper is support for real-time recognition. This is the ability to perform speculative inference and preview with the current buffer contents without waiting for 30 seconds of audio to arrive, 30 seconds being the size of the buffer Whisper needs to process.
To enable real-time recognition, add the argumentAILIA_SPEECH_FLAG_LIVE
to ailiaSpeechCreate
.
ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_LIVE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);
The transcribed text preview is notified to IntermediateCallback
, which can also abort speech recognition by giving a return value of 1.
int intermediate_callback(void *handle, const char *text){
printf("%s\n", text);
return 0; // return 1 to abort
}
ailiaSpeechSetIntermediateCallback(net, &intermediate_callback, NULL);
When using real-time recognition, the detection of repetitions is also enabled. When the same word is repeated N times in a row, the system waits until a new word comes in.
Normal mode is recommended when inputting from an audio file, and live mode is recommended when inputting from a microphone.
Language detection
If the ailiaSpeechSetLanguage
API is not called, language detection is performed for each segment, otherwise the specified language is assumed for all segments. Calling ailiaSpeechSetLanguage
is recommended whenever possible since language detection can be wrong, especially on short audio files.
Translation
You can translate into English by giving the argumentAILIA_SPEECH_TASK_TRANSLATE
to ailiaSpeechCreate
Speed up
ailia AI Speech can run on CPU or GPU. When running on CPU, IntelMKL
is used for Windows and Accelerate.framework
is used for macOS to speed up the process. On GPU, Windows uses cuDNN
and macOS uses Metal
for acceleration.
Download of ailia AI Speech trial version
A 1-month free evaluation version of ailia AI Speech is available for download at the URL below. The evaluation version includes the library, sample programs, and the Unity Package.
The license file sent with the evaluation license application should be placed in the same folder as ailia_speech.dll
for Windows. For macOS, place it in ~/Library/SHALO/,
for Linux, place it in ~/.shalo/
Build of ailia AI Speech trial version
For Windows, use the x64 Native Tools Command Prompt
for VS 2019 to build. download the SDK, go to the cpp
folder and run the following command.
cl ailia_speech_sample.cpp wave_reader.cpp ailia.lib ailia_audio.lib ailia_tokenizer.lib ailia_speech.lib
When the build is completed, ailia_speech_sample.exe
is generated, and it can be executed after placing the license file in the cpp
folder.
The setup instructions for macOS and Linux are available at the page below in Japanese only. Please contact us for support in English.
Applications for ailia AI Speech
Taking notes during meetings
It can be used to take minutes of important meetings, even offline, by recognizing speech in real-time without length limitations.
Hand-free voice memo
It can be used for voice memos and daily reports without manual text input.
Search from audio in a video
Text can be extracted from the audio of a video file and used to search for specific scenes by word.
Split audio line by line
It can be used to extract text from an audio file and split the file into separate audio files for each line.
Call center voice analysis
It can be used as an input method for call centers to search for the best answer plan based on the current call content.
Conversations with avatars
This can be used as a way to enter a conversation with your avatar.
ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.
ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.