Tacotron2 Thai TTS

Published in

NECTEC

4 min readSep 15, 2021

บทความนี้เราจะมาสร้างโมเดลเสียงสังเคราะห์ด้วย Tacotron2 กัน จะบอกว่ามันง่ายๆมากๆ ถึงแม้ว่าคุณจะไม่มีความรู้ด้าน AI หรือ ภาษาศาสตร์มาเลยคุณก็ทำได้

Tacotron2 คืออะไรเข้าไปอ่าน link ด้านล่างเลยครับ เพราะถ้าจะให้ผมอธิบาย บทความนี้คงจะยาวมากๆ

PyTorch

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding…

pytorch.org

เร่ิมกันที่เตรียม docker กันก่อนครับ ใครจะเป็น windows หรือเครื่อง server แนะนำให้เป็นเครื่องที่มี gpu เดี๋ยวนี้โน๊ตบุคที่มี gpu ราคาไม่ได้แรงมากเหมือนเมื่อก่อน ผมใช้ docker nvidia/cuda ตาม link ด้านล่างนี้ครับ

Docker Hub

Edit description

hub.docker.com

ในบทความนี้ผมจะลง anaconda3 ด้วย

wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
bash ./Anaconda3-2020.07-Linux-x86_64.sh

หลังจากลงเรียบร้อยแล้วก็ restart terminal ครับ แล้วก็ใช้คำสั่งด้านล่างนี้

conda update --all --yes

ผมจะทำการ create envelopment ขึ้นมา ผมจะให้ชื่อว่า nemo และใช้ python version 3.8

conda create -n nemo python=3.8

ลองใช้คำสั่ง list ดูครับจะเห็น env ที่เราสร้างขึ้นมา

conda env list

จากนั้นเราจะเข้าไปใช้ env ที่เราสร้างด้วยคำสั่ง

conda activate nemo

ถ้าจะออกจาก env ก็ให้ใช้คำสั่ง

conda deactivate

ถ้าอยากจะลบ env ออกใช้คำสั่ง

conda remove -n nemo --all

ขั้นตอนต่อไปเป็นการลง jupyterlab

conda install -c conda-forge jupyterlab
jupyter notebook password

หลังจาก set password และลงเสร็จแล้วก็สั่งรัน

jupyter-lab --port 8888 --allow-root --no-browser --ip=0.0.0.0

เสร็จแล้วลองเข้าผ่าน browser ดูครับ ลองกดเครื่องหมายบวกด้านมุนบนซ้ายดู จะขึ้น tab launcher ด้านขวาแล้วเลือ Notebook Python 3 (ipykernel)

ขั้นตอนต่อไปผมจะทำการลง toolkit nemo version 1.0.2 ให้พิมพ์คำสั่งด้านล่างเข้าไปใน jupyter notebook แล้วสั่งรันครับ

!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install unidecode
!pip install pytorch-lightning==1.4.4
BRANCH = 'v1.0.2'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]

หรือจะลงจาก code บน github ก็ได้นะครับ

git clone https://github.com/NVIDIA/NeMo.git# install tool
cd NeMo/requirements
find . -name "requirement*" -type f -exec pip install -r '{}' ';'

หมายเหตุ แต่ต้องเพิ่มบรรทัดด้านล่างนี้เข้าไปใน tacotron2.py เพื่อให้เห็น path nemo ที่ clone มาจาก github

import sys
sys.path.insert(0, “/opt/NeMo”)

หลังจากลงเสร็จแล้วลองเช็ค version ดูครับ

!python --version
!pip --version
!pip show torch
!pip show pytorch_lightning

หลังจากนั้นผมจะทำการเตรียมข้อมูลสำหรับนำไป train ก่อนอื่นให้เข้าไปที่

https://www.aiforthai.in.th/corpus.php

ทำการสมัครสมาชิก แล้วเข้าไป download ไฟล์ TSynC2_Nun ได้ไฟล์แล้ว upload แล้วแตกไฟล์ ไฟล์ที่ได้ออกมาจะทำการตัดคำให้แล้วแล้ว แถมมี phoneme ให้ด้วย

สาทิตย์|แฉ|ผลาญ|งบ|ซื้อ|รถ|หรู|เพียบ|
สาทิตย์|แฉ|ผลาญ|งบ|ซื้อ|รถ|หรู|เพียบ
s-aa-z^-4|th-i-t^-3|*ch-xx-z^-4|*phl-aa-n^-4|*ng-o-p^-3|*s-vv-z^-3|*r-o-t^-3|*r-uu-z^-4|*ph-iia-p^-2|*

แต่เราจะไม่ใช้ wordseg เราจะใช้ pm แทน แล้วเราจะเตรียมไฟล์ให้เป็นแบบด้านล่างนี้

{
  “audio_filepath”: “TSync2/wavTrim/tsync2_noon_1_4650_trim.wav”,       
  “text”: “สา ทิตย์ แฉ ผลาญ งบ ซื้อ รถ หรู เพียบ”,
  “duration”: 3.0650340136054424
}

สำหรับใครอยากทราบว่า pm คืออะไรสามารถตามอ่านได้จาก journal “Efficient Two-stage Processing for Joint Sequence Model-based Thai Grapheme-to-Phoneme Conversion” กับ paper “A Great Reduction of WER by Syllable Toneme Prediction for Thai Grapheme to Phoneme Conversion” อ่านเพิ่มเติมตาม link ด้านล่างเลยนะครับ

https://www.researchgate.net/publication/329593496_Efficient_Two-stage_Processing_for_Joint_Sequence_Model-based_Thai_Grapheme-to-Phoneme_Conversionhttps://www.researchgate.net/publication/340053934_A_Great_Reduction_of_WER_by_Syllable_Toneme_Prediction_for_Thai_Grapheme_to_Phoneme_Conversion

ผมจะใช้ library ที่สามารถใช้งาน pm seg ได้ ตาม link ด้านล่างนี้เลยครับ

https://colab.research.google.com/drive/1G7OUNsCC-B5XHNd8V5Et1ZKpJp4R66hg

ไฟล์เสียงผมจะทำการ trim ด้วยครับ เสร็จแล้วก็หา duration

import librosa
from scipy.io import wavfile
def trimWav(fileName_wav):
    outputFilename = fileName_wav.replace('/wav/', '/wavTrim/').replace('.wav', '_trim.wav').replace('..','.')
    y, sr = librosa.load(fileName_wav)
    yt, _ = librosa.effects.trim(y, top_db=30)
    duration = librosa.get_duration(yt, sr)
    wavfile.write(outputFilename, sr, (yt*32768).astype('int16'))
    return '{"audio_filepath": "'+outputFilename+'", "text": "'+getPm(fileName_wav.replace('..','.'))+'", "duration": '+str(duration)+'}\n'

ผมจะใช้ multiprocessing เพื่อช่วยให้การประมวลผลเร็วขึ้น แล้วก็ใช้ tqdm เพื่อแสดง progress

import os
import glob
import tqdm
import multiprocessing
from sklearn.model_selection 
import train_test_split
def prepareData(inputPath):
    fileList = glob.glob(inputPath+'/wav/*.wav')
    if not os.path.exists(os.path.join(inputPath, 'wavTrim')):
        os.makedirs(os.path.join(inputPath, 'wavTrim'))
    resultList = []
    with multiprocessing.Pool(3) as process:
        resProcess = process.imap(trimWav, [fileName for fileName in fileList])
        for resTqdm in tqdm.tqdm(resProcess, total=len(fileList)):
            resultList.append(resTqdm)
    return resultList
dataSet = prepareData('TSync2')

หลังจากรันเสร็จแล้ว เราก็จะมาแบ่ง data set เป็น train set กับ test set เนื่องจาก data เราน้อยผมจะแบ่ง test set แค่ 5%

trainSet, testSet = train_test_split(dataSet, test_size=0.05, random_state=42)
print("Count trainSet: ", len(trainSet))
print("Count testSet: ", len(testSet))
with open(os.path.join('TSync2', 'tsync2_train.json'), 'w') as fw_train:
    for dataTrain in trainSet:
        if dataTrain!='' and dataTrain is not None:
            fw_train.write(dataTrain)
with open(os.path.join('TSync2', 'tsync2_test.json'), 'w') as fw_test:
    for dataTest in testSet:
        if dataTest!='' and dataTest is not None:
            fw_test.write(dataTest)

ขั้นตอนต่อมาเป็นขั้นตอนการ train ก่อนอื่นเราจะโหลด script และ config มาก่อน

!wget https://raw.githubusercontent.com/NVIDIA/NeMo/v1.0.2/examples/tts/tacotron2.py
!mkdir conf && cd conf && wget https://raw.githubusercontent.com/NVIDIA/NeMo/v1.0.2/examples/tts/conf/tacotron2.yaml && cd ..

เนื่องจากข้อมูลเราเป็นภาษาไทย เราต้องทำการแก้ไข labels ใน conf/tacotron2.yaml

labels: ['ก', 'ข', 'ค', 'ฆ', 'ง', 'จ', 'ฉ', 'ช', 'ซ', 'ฌ', 'ญ', 'ฎ', 'ฏ', 'ฐ', 'ฑ', 'ฒ', 'ณ', 'ด', 'ต', 'ถ', 'ท', 'ธ', 'น', 'บ', 'ป', 'ผ', 'ฝ', 'พ', 'ฟ', 'ภ', 'ม', 'ย', 'ร', 'ฤ', 'ล', 'ว', 'ศ', 'ษ', 'ส', 'ห', 'ฬ', 'อ', 'ฮ', 'ะ', 'ั', 'า', 'ำ', 'ิ', 'ี', 'ึ', 'ื', 'ุ', 'ู', 'เ', 'แ', 'โ', 'ใ', 'ไ', 'ๅ', '็', '่', '้', '๊', '๋', '์', '#', ' ']

หลังจากนั้น เปิด terminal มาแล้วสั่งรันคำสั่งด้านล่างได้เลย

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
export LC_ALL=C.UTF-8
python tacotron2.py sample_rate=22050 \
train_dataset=TSync2/tsync2_train.json \
validation_datasets=TSync2/tsync2_test.json \
trainer.max_epochs=5000 trainer.check_val_every_n_epoch=5 \
model.train_ds.dataloader_params.batch_size=32 \
model.validation_ds.dataloader_params.batch_size=32

จากคำสั่งรันด้านบน เครื่องที่ผมใช้งาน ram น้อยไปหน่อย ผมจึงใช้ batch_size=32 , set ค่า max_epochs=5000 และเซ็ต check_val_every_n_epoch=5 เพื่อให้ save checkpoint

เราสามารถติดตามความคืบหน้าในการ train และดูค่า loss ได้ด้วยใช้งาน TensorBoard วิธีลงตาม link ด้านล่างเลย

How to use TensorBoard with PyTorch - PyTorch Tutorials 1.9.0+cu102 documentation

TensorBoard is a visualization toolkit for machine learning experimentation. TensorBoard allows tracking and…

pytorch.org

ลงเสร็จแล้วก็สั่งรัน

tensorboard --port=3333 --bind_all --logdir nemo_experiments/Tacotron2/

เข้าดูได้ผ่านเวป browser

train รู้สึกนาน อยากจะหยุด train ก่อนแล้วมา train ใหม่ ก็สามารถ retrain ได้จาก checkpoints นี้แหละครับ ให้เข้าไปเพิ่มบรรทัดข้างล่างนี้เข้าไปในไฟล์ tacotron2.yaml ในส่วน trainer:

resume_from_checkpoint: '/opt/NeMo/training_scripts/nemo_experiments/Tacotron2/2022-01-19_21-31-01/checkpoints/Tacotron2--val_loss=14.2026-epoch=14-last.ckpt'

เมื่อ train เสร็จแล้ว จะได้ไฟล์ model .nemo มาซึ่งจะอยู่ใน folder checkpoints เช่น nemo_experiments/Tacotron2/2021–09–15_10–26–54/checkpoints/Tacotron2.nemo

ขั้นตอนการทดสอบสังเคราะห์เสียง

เริ่มต้นด้วยการโหลด model ซึ่งเราสามารถโหลดได้ 3 แบบ คือ โหลดจาก pretrained โหลดจาก checkpoints และโหลดจาก โมเดลที่เรา train ได้ .nemo

from nemo.collections.tts.models import Tacotron2Model
#model = Tacotron2Model.from_pretrained("tts_en_tacotron2").eval().cuda()#model =  Tacotron2Model.load_from_checkpoint("nemo_experiments/Tacotron2/2022-01-19_21-31-01/checkpoints/Tacotron2--val_loss=14.2026-epoch=14-last.ckpt").eval().cuda()model = Tacotron2Model.restore_from("nemo_experiments/Tacotron2/2021-09-15_10-26-54/checkpoints/Tacotron2.nemo").eval().cuda()

Load vocoder ผมใช้ HifiGan ให้คุณภาพเสียงดีเลยทีเดียว

from nemo.collections.tts.models import  HifiGanModel
vocoder = HifiGanModel.from_pretrained("tts_hifigan").eval().cuda()

ต่อไปเป็นการสังเคราะห์เสียง

token_input = model.parse('ทด สอบ สัง เคราะห์ เสียง พูด')
spec_gen = model.generate_spectrogram(tokens=token_input.to('cuda:0'))
audio = vocoder.convert_spectrogram_to_audio(spec=spec_gen).to('cuda:0')

แสดงผลไฟล์เสียงสังเคราะห์ที่ Gen ได้

import IPython.display as ipd
ipd.Audio(audio.to(‘cpu’).detach().numpy()[0], rate=22050)

หรือถ้าอยากจะ save เป็น ไฟล์เสียงก็ทำได้

import soundfile as sf
sf.write("test.wav", audio.to('cpu').detach().numpy()[0], 22050)

แสดงเป็น spectrogram

from matplotlib.pyplot import imshow
from matplotlib import pyplot
imshow(spec_gen.to('cpu').detach().numpy()[0], origin="lower")
pyplot.show()

เรียบร้อยแล้วครับ สำหรับโมเดลเสียง tacotron2 การใช้ pm แทน wordseg จะช่วยแก้ปัญหาการอ่านผิดๆถูกๆ ที่เคยเกิดขึ้น ยิ่งไปกว่านั้น จำนวน unique word ที่มี 103,265 นั้นลดลงเหลือเพียง 19,601 แน่นอนว่าจะช่วยให้ node network ลดลงด้วย

Reference

Anaconda and Jupyter Notebook Install Instructions - Ubuntu

Instructions tested with Ubuntu 20.04 64-bit and Continuum's Anaconda3 2021.05 1. Open the Terminal program by going to…

mas-dse.github.io

Index of /

Edit description

repo.anaconda.com

Tacotron2 Thai TTS

PyTorch

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding…

Docker Hub

Edit description

How to use TensorBoard with PyTorch - PyTorch Tutorials 1.9.0+cu102 documentation

TensorBoard is a visualization toolkit for machine learning experimentation. TensorBoard allows tracking and…

Anaconda and Jupyter Notebook Install Instructions - Ubuntu

Instructions tested with Ubuntu 20.04 64-bit and Continuum's Anaconda3 2021.05 1. Open the Terminal program by going to…

Index of /

Edit description

Written by Sittipong Saychum