Addition of Python API to ailia AI Voice and ailia AI Speech
About ailia AI Voice and ailia AI Speech
ailia AI Voice is a library that performs speech synthesis using GPT-SoVITS, while ailia AI Speech is a library that performs speech recognition using Whisper.
Previously, these libraries provided bindings for C++, C#, and Flutter, and we just added Python bindings.
ailia AI Voice and ailia AI Speech have very few dependencies and run on ONNX without using PyTorch, enabling stable operation without relying on framework versions. Additionally, after prototyping in Python, you can seamlessly deploy to mobile devices like iOS or Android using bindings for Unity or Flutter.
Installation
Both modules can be install via pip
pip3 install ailia_voice
pip3 install ailia_speech
Usage
Using the Python bindings for ailia AI Voice and ailia AI Speech, speech synthesis and speech recognition can be achieved in just a few line of code. The models are also downloaded automatically.
Speech synthesis with ailia AI Voice
As shown in the sample below, download the reference_audio_girl.wav
file, perform speech synthesis based on the voice in this file, and save the result.
import ailia_voice
import librosa
import time
import soundfile
import os
import urllib.request
# Load reference audio
ref_text = "水をマレーシアから買わなくてはならない。"
ref_file_path = "reference_audio_girl.wav"
if not os.path.exists(ref_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/gpt-sovits/reference_audio_captured_by_ax.wav",
"reference_audio_girl.wav"
)
audio_waveform, sampling_rate = librosa.load(ref_file_path, mono=True)
# Infer
voice = ailia_voice.GPTSoVITS()
voice.initialize_model(model_path = "./models/")
voice.set_reference_audio(ref_text, ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA, audio_waveform, sampling_rate)
buf, sampling_rate = voice.synthesize_voice("こんにちは。今日はいい天気ですね。", ailia_voice.AILIA_VOICE_G2P_TYPE_GPT_SOVITS_JA)
# Save result
soundfile.write("output.wav", buf, sampling_rate)
Speech recognition with ailia AI Speech
As shown below, download the demo.wav
file and perform speech recognition on it. Since the return value is a generator, you can sequentially obtain the recognition results even for long audio files.
import ailia_speech
import librosa
import os
import urllib.request
# Load target audio
input_file_path = "demo.wav"
if not os.path.exists(input_file_path):
urllib.request.urlretrieve(
"https://github.com/axinc-ai/ailia-models/raw/refs/heads/master/audio_processing/whisper/demo.wa",
"demo.wav"
)
audio_waveform, sampling_rate = librosa.load(input_file_path, mono=True)
# Infer
speech = ailia_speech.Whisper()
speech.initialize_model(model_path = "./models/", model_type = ailia_speech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL)
recognized_text = speech.transcribe(audio_waveform, sampling_rate)
for text in recognized_text:
print(text)
Parameters
Various parameters can be passed to the ailia SDK constructor. For example if you want to use the GPU, you can configure it as shown below.
import ailia
import ailia_voice
import ailia_speech
env_id = ailia.get_gpu_environment_id()
voice = ailia_voice.GPTSoVITS(env_id = env_id)
speech = ailia_speech.Whisper(env_id = env_id)
Works offline
If the AI model files exist in the model_path
, both speech synthesis and speech recognition will operate completely offline.
Speech recognition real time feedback
By providing a function to Whisper’s callback, it is possible to obtain intermediate results during speech recognition.
import ailia_speech
def f_callback(text):
print(text)
speech = ailia_speech.Whisper(callback = f_callback)
Documentation
ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.
ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.