Thai G2P seq-to-seq

Pawit Dev

4 min readApr 14, 2022

เราจะมาทำความรู้จัก G2P sequence-to-sequence กัน โดยผมได้แรงบันดาลใจจากบทความนี้ในการลองพัฒนา G2P ภาษาไทยขึ้นมากัน !!!

G2P Sequence-to-Sequence

วันนี้เราจะมาทำ G2P Sequence-to-Sequence โดยในบทความนี้ผมจะอ้างอิงจาก journal “Efficient Two-stage Processing for Joint…

medium.com

โดยเขาได้แนะนำแหล่งความรู้เพิ่มเติม…

https://www.researchgate.net/publication/329593496_Efficient_Two-stage_Processing_for_Joint_Sequence_Model-based_Thai_Grapheme-to-Phoneme_Conversionhttps://www.researchgate.net/publication/340053934_A_Great_Reduction_of_WER_by_Syllable_Toneme_Prediction_for_Thai_Grapheme_to_Phoneme_Conversion

G2P (Ganeme to Phoneme)

G2P คือการแปลงขอความจากตัวเขียน เช่น “ฉัน รัก เธอ” → สัทอักษร (Phoneme) หรือ อักษรที่ใช้กำหนดแทนเสียงนั่นเอง ตัวอย่างรูปที่ขึ้้นปกนั้น คือ output ที่ได้จากการ Train Thai G2P ครั้งนี้นั่นเอง หากทุกคนทำตามบทความนี้ ทุกคนจะสามารถเปลี่ยนคำใด ๆ ในภาษาไทยให้เป็น Phoneme ได้แน่นอน (แต่ต้องมี Data ให้โมเดล Train สักหน่อยนะ)

ใครที่ลองทำดูแล้วลองกลับมาเปลี่ยนประโยคนี้เป็น Phoneme หน่อย 555
Sentence : แต่ต่อให้พยายามเปลี่ยนแค่ไหน เขาก็คงไม่รักเรากลับอยู่ดี
Phoneme : ……..

โดย Tool ที่จะนำเสนอให้ลองใช้นั้นคือ cmusphinx เพราะมี pipeline การเตรียม Data and Training ที่ง่าย ๆ มาก ๆ

GitHub - cmusphinx/g2p-seq2seq: G2P with Tensorflow

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit [1]. A lot of…

github.com

Setting environment, Preprocessing, Training and Testing

Data ที่ใช้อยู่ภายใต้ NECTEC licensed TSync2 under CC-BY-NC-SA

Thai_G2P_seq-to-seq

Using cmusphinx

GitHub - pawito236/Thai_G2P_seq-to-seg: Using cmusphinx / g2p-seq2seq

github.com

โดยให้ติดตั้ง Lib ตามลำดับดังนี้ เพราะผมเจอปัญหาเรื่องเวอร์ชันเยอะมาก กว่าจะเจอเวอร์ชันที่ match กัน ทำเอาลำบากมากเลย555

!git clone https://github.com/cmusphinx/g2p-seq2seq.git
%cd g2p-seq2seq/
!pip install tensor2tensor==1.7.0
!python setup.py install
!pip install tensorflow==1.13.1
!pip install tensorflow-gpu==1.13.1

จากนั้นเราจะโหลด Data TSync2 under CC-BY-NC-SA มาเป็น Training set เพราะ Data ที่ได้เราจะได้ ประโยคภาษาไทยที่มีการตัดคำและมี Phoneme มาให้ด้วย

import os
import shutil
def download():
   url =    “https://github.com/korakot/corpus/releases/download/v1.0/AIFORTHAI-TSync2Corpus.zip"
   print(“NECTEC licensed TSync2 under CC-BY-NC-SA”)
   print(“Start downloading: .. “)
   os.system(f”wget {url}”)
   os.system(“unzip AIFORTHAI-TSync2Corpus.zip”)
   os.system(“rm AIFORTHAI-TSync2Corpus.zip”)
   print(“Finished”)
download()

ที่นี้เราจะมาทำความเข้าใจ Data กัน
ซึ่ง data ที่ได้มา จะมีจำนวนคำและพยางค์ที่ไม่เท่ากันอยู่ หากว่าเรามานั่งทำมือเพื่อให้ได้ใช้ Data ทุกชุด โดยที่เราต้องมานั่งทำ Data ด้วยมือ เพื่อดูว่า คำกับพยางค์ ตรงกันหรือไม่ คงจะกินเวลาน่าดู

ในบทความนี้ขอยกตัวอย่างแบบเร็ว ๆ คือจะเลือกไฟล์์ที่มีจำนวนพยางค์และจำนวน “|” เท่ากัน

import re
my_string = “ก็|ต่าง|ปรับ|สี|สัน|ให้|อินเทรนด์|บู๊ทส์|รีเทล|จำกัด|”
pm = ‘k-@@-z^-2|*t-aa-ng^-1|*pr-a-p^-1|*s-ii-z^-4|*s-a-n^-4|*h-a-j^-2|*z-i-n^-0|thr-ee-n^-0|*b-uu-t^-3|*r-ii-z^-0|th-ee-l-0|*c-a-m^-0|k-a-t^-1|*’print(len(re.findall(“\|”, my_string)))
print(len(re.findall(“\|”, pm)))>>> 10
>>> 13

ฟังก์ชันการทำ dataset…เนื่องจากเราตั้งเงื่อนไขเอาไว้ ทำให้ได้ไฟล์ออกมาค่อยข้างน้อยจริง ๆ แต่ก็ไม่เป็นไร

import glob
import re# Get phoneme from txt file
def linePm(inputPath):
   c = 0
   data = []
   fileList = glob.glob(inputPath+’/wrd_ph/*.txt’)
   for fileName in fileList:
      # print(“in loop”)
      # Using readlines()
      file1 = open(fileName, ‘r’)
      Lines = file1.readlines()
      sent = Lines[0]
      pm = Lines[-1][:-2]
      word_len = len(re.findall(“\|”, sent))
      pm_len = len(re.findall(“\|”, pm))
      # Check if the same len word
      if(word_len != pm_len):
         continue
      c = c+1
      wrdArr = sent.split(“|”)
      pmArr = pm.replace(‘*’,’’).split(‘|’)
      for n in range(word_len):
         pmStr = ‘’
         for pm in pmArr[n].split(“-”):
            if pm != ‘ ‘:
               pmStr = pmStr + pm + ‘ ‘
         data.append(“{} {}”.format(wrdArr[n], pmStr.strip()))
      print(“Total readable file :”, c)
      return(data)dataSet = linePm('/content/g2p-seq2seq/TSync2')
>>> Total readable file : 280len(dataSet)
>>> 2860

จากนั้นเราจะมา write .dict file for training input

import os
import glob
import tqdm
import multiprocessing
from sklearn.model_selection import train_test_splittrainSet, testSet = train_test_split(dataSet, test_size=0.05, random_state=42)
print("Count trainSet: ", len(trainSet))
print("Count testSet: ", len(testSet))with open(os.path.join('TSync2', 'g2p_train.dict'), 'w') as fw_train:
   for dataTrain in trainSet:
      if dataTrain!='' and dataTrain is not None:
         fw_train.write(dataTrain+'\n')with open(os.path.join('TSync2', 'g2p_test.dict'), 'w') as fw_test:
   for dataTest in testSet:
      if dataTest!='' and dataTest is not None:
         fw_test.write(dataTest+'\n')>>> Count trainSet:  2717 
>>> Count testSet:  143

Training and Testing

ให้เข้าไปใน params.py 
แก้ไข self.gpu_order = “0”import os
os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’!mkdir model
!mkdir output!g2p-seq2seq --train data/g2p_train.dict --model_dir model/
!g2p-seq2seq --decode data/g2p_test.dict  --model_dir model/ --output output/result_seq2seq.txt%%timeit
!g2p-seq2seq --decode data/g2p_test.dict  --model_dir model/ --output output/result_seq2seq.txt>>> 1 loop, best of 5: 8.84 s per loop
>>> 143 words for testing
>>> around 62 ms per word

โดยสามารถแก้ config การ Training ได้ใน g2p-seq2seq/app.py

เนื่องจากเป็น Demo เลยลองเทรนแค่ 4000 step แต่ผลลัพธ์ก็ออกมาดีเกินคาด !!!
Download Model

GitHub - pawito236/Thai_G2P_seq-to-seg: Using cmusphinx / g2p-seq2seq

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Resource ด้าน ภาษาไทยนั้นค่อนข้างมีน้อยเมื่อเทียบกับต่างประเทศ หากเรามี Data มากกว่านี้ ผมเชื่อว่าจะทำอะไรได้หลากหลายกว่านี้แน่นอน

ยกตัวอย่างงาน TTS(Text-to-Speech) ที่ผมได้ทำไว้ ผมได้ใช้ Phoneme ในการ Train Tacotron2 ด้วย Data ชุดนี้เหมือนกัน สามารถติดตามได้เลยครับ

Tacotron 2 Thai TTS with phoneme

ในบทความนี้ เราจะนำเสนอวิธีการสร้าง AI ให้สามารถสังเคราะห์เสียงขึ้นมาได้ด้วย Tacotron2

medium.com

GitHub - pawito236/TTS_Tacotron2: Using Tacotron2 and phoneme as input

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Reference :

GitHub - cmusphinx/g2p-seq2seq: G2P with Tensorflow

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit [1]. A lot of…

github.com

G2P Sequence-to-Sequence

วันนี้เราจะมาทำ G2P Sequence-to-Sequence โดยในบทความนี้ผมจะอ้างอิงจาก journal “Efficient Two-stage Processing for Joint…

medium.com

Thai G2P seq-to-seq

G2P Sequence-to-Sequence

วันนี้เราจะมาทำ G2P Sequence-to-Sequence โดยในบทความนี้ผมจะอ้างอิงจาก journal “Efficient Two-stage Processing for Joint…

G2P (Ganeme to Phoneme)

GitHub - cmusphinx/g2p-seq2seq: G2P with Tensorflow

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit [1]. A lot of…

Setting environment, Preprocessing, Training and Testing

Thai_G2P_seq-to-seq

GitHub - pawito236/Thai_G2P_seq-to-seg: Using cmusphinx / g2p-seq2seq

GitHub - pawito236/Thai_G2P_seq-to-seg: Using cmusphinx / g2p-seq2seq

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Tacotron 2 Thai TTS with phoneme

ในบทความนี้ เราจะนำเสนอวิธีการสร้าง AI ให้สามารถสังเคราะห์เสียงขึ้นมาได้ด้วย Tacotron2

GitHub - pawito236/TTS_Tacotron2: Using Tacotron2 and phoneme as input

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

GitHub - cmusphinx/g2p-seq2seq: G2P with Tensorflow

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit [1]. A lot of…

G2P Sequence-to-Sequence

วันนี้เราจะมาทำ G2P Sequence-to-Sequence โดยในบทความนี้ผมจะอ้างอิงจาก journal “Efficient Two-stage Processing for Joint…

Written by Pawit Dev