Thai Named Entity Recognition with BiLSTM-CRF using Word/Character Embedding (Keras)

9 min readOct 27, 2019

สวัสดีครับ

เนื่องจาก ทำ Thesis เกี่ยวกับ Named Entity Recognition เลยอยากมาแชร์ประสบการณ์เกี่ยว NLP ภาษาไทยกัน ซึ่งมี Code เป็น Jupyter Python เก็บไว้ที่ Github ตามลิ้งนี่เลยครับ

SuphanutN/Thai-NER-BiLSTM-WordCharEmbedding

Thai Named Entity Recognition with BiLSTM-CRF using Word/Character Embedding …

github.com

น่าจะรันตามได้ไม่ยาก แค่ต้องลง dependency นิดหน่อยเฉยๆมั้งครับ 5555 ก็ลองๆดูกันได้ฮะ (จริงๆเขียนไว้เผื่อกลับมาอ่านเองจะได้เข้าใจด้วย ความจำสั้น 555555)

หรือใครจะอ่านเป็น Link Research paper ก็ตามนี้นะครับ

https://www.researchgate.net/publication/336798652_Thai_Named_Entity_Recognition_Using_Bi-LSTM-CRF_with_Word_and_Character_Representation

Named Entity Recognition คืออะไร?

อันนี้ใครอยากรู้ไปอ่าน สรุปของ ลูกคิด ได้เลยนะฮะ

สรุป Survey of Named Entity Recognition and Classification (NERC)

สวัสดีครับทุกคน ก่อนอื่นเลยต้องขอขอบคุณพี่ Guangming C. Sangkeettrakarn มาก ๆ ที่เอา survey มาให้อ่านครับ

lukkiddd.com

ขายของ เพื่อนผมเอง 5555

TL;DR

เป็นวิธีการหนึ่งของ NLP เพื่อทำ Feature Extraction โดยจะทำการแยกแยะคำเฉพาะออกจากประโยค เช่น ได้ประโยค

นายธนาธรเจอนางสาวยิ่งลักษณ์ที่มหาวิทยาลัยจุฬา เช้าวันนี้

โดยเครื่องมือนี้ทำการระบุว่าในประโยคดังกล่าวมีคำที่เป็นชื่อเฉพาะคำได้บ้างซึ่งได้แก่

นายธนาธรเจอนางสาวยิ่งลักษณ์ที่มหาวิทยาลัยจุฬา เช้าวันนี้

นายธนาธร เป็น ชื่อคน (Person)
นางสาวยิ่งลักษณ์ เป็น ชื่อคน (Person)
มหาวิทยาลัยจุฬา เป็น ชื่อองค์กรณ์ (Organization)
เช้า เป็น การระบุช่วงเวลา (Time)
วันนี้ เป็น การระบุวัน (Date)

เอาล่ะ แล้วถ้าอยากจะเขียนโปรแกรมเพื่อทำล่ะ ทำยังไงดี

Review Model

ผมขอทำส่วนนี้ในการ Review Architecture ขึ้นมาก่อนละกันนะครับ โดย idea หลักๆผมก็เอามาจาก State-of-the-Art ภาษาอังกฤษที่หลายๆท่านได้แชร์มา โดย Model ที่ผมใช้นี้จะแอบเก่าซักหน่อยเพราะ adapt มาจาก code ภาษาอังกฤษของ

Enhancing LSTMs with character embeddings for Named entity recognition

This is the fifth in my series about named entity recognition with python. If you haven't seen the last four, have a…

www.depends-on-the-definition.com

ยังไงก็เริ่มกันเลยละกันครับ

ตัว Deep Learning Model ผมขอพูดถึง 2 ส่วน ได้แก่

Embedding ในส่วนนี้ ผมของแยกย่อยเป็น 2 ส่วนคือ

Character Embedding ผมใช้ LSTM ในการ train character level from scratch แล้วทำให้เป็น word vector จาก character เหล่านั้น

Word Embedding อันนี้ ต้องขอขอบคุณ คุณ Charin ที่ได้ทำ Thai2Fit มา อันนี้ผมดึงมาใช้โต้งๆ เลยจาก v0.32 5555 ใช้ง่ายมากครับ load weight ได้เลย

2. Main Model (Bi-LSTM-CRF)

ในส่วนนี้จะใช้ Bi-directional LSTM ซึ่งจะรับ input 2 ส่วนจาก Character/Word Embedding แล้วนำผลลัพธ์ไปเข้า CRF อีกต่อนึง

เอาล่ะ อันนี้คงเห็นภาพคร่าวๆแล้ว มาเริ่มกันเลยดีกว่า !

Dataset

สำหรับ dataset ที่ใช้ในการทดลองนี้ ผมใช้ opensource dataset จาก ThaiNER ก็ขอขอบคุณมา ณ ที่นี้ สามารถเข้าไปดูได้ที่ อีกคนที่อยากขอบคุณก็คือ คุณ Nutcha ที่เป็นคนเริ่มทำ NER dataset ชุดแรกจาก InterBEST2009 และทำ Research เกี่ยวกับการทำ NER ภาษาไทยโดยการใช้ CRF ด้วย ขอบคุณมากฮะ

wannaphongcom/thai-ner

Thai Named Entity Recognition. Contribute to wannaphongcom/thai-ner development by creating an account on GitHub.

github.com

ส่วนใครอยากจะ contribute project สามารถช่วย label ได้ที่

กรอกข้อมูล : โครงการคลังข้อมูล NER ภาษาไทย

Edit description

thainlp-203815.appspot.com

สำหรับ Code ในการ Load Dataset ก็จะประมาณนี้

with open(RAW_PATH + 'datatrain.data', 'rb') as file:
    datatofile = dill.load(file)
    
tagged_sents = []
for i in datatofile:
    text_inside = []
    for j in i:
        text_inside.append((j[0],j[2]))
    tagged_sents.append(text_inside)
    
train_sents, test_sents= train_test_split(tagged_sents, test_size=0.2, random_state=112)

โดย Dataset ชุดนี้มีทั้งหมด 6148 ประโยค (197704 words, 50593 NER tokens) แบ่ง train/test 80:20

โดย dataset ก็จะอยู่ในรูป BIO ตามตัวอย่างนี้นะครับ

Preprocess and Prepare word / character vector

ในการทดลองนี้ Input ที่เราจะใช้สำหรับ Deep Learning Model คือ

Word Embedding — Thai2Fit ของคุณ Charin นะครับ

cstorm125/thai2fit

ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of…

github.com

โดยโหลดจาก file .binary ใช้ Code ดังต่อไปนี้

thai2fit_model = KeyedVectors.load_word2vec_format(W_MODEL_PATH+'thai2vecNoSym.bin',binary=True)
thai2fit_weight = thai2fit_model.vectors

โดยเป็น word vector ขนาด 55,677 คำ โดยแต่ละคำมี 400 dimensions

หลังจากนั้น เราก็ทำการสร้าง Word Dictionary หรือ Lookup Table ของ dataset ซึ่งประกอบด้วย

Word Dictionary
NER Dictionary
Thai2Fit Dictionary

ซึ่งอันนี้เราใช้ Thai2Fit ตัว Word Dictionary เราจะไม่ได้ใช้ ซึ่งถ้าไม่ได้ใช้ Thai2Fit เนี่ย เราจะใช้ Word Dictionary เป็น One Hot Encoder แทน โดยจะเพิ่ม PAD กับ Unknown ไว้สำหรับคำที่เป็น Padding (ตัวว่างให้คำมันครบ) กับคำที่ไม่รู้จัก ซึ่งถ้าเป็นคำที่ไม่รู้จักก็จะใส่เป็น Unknown แทน เดี๋ยวอันนี้จะมาอธิบายทีหลัง ตอนทำ เตรียม input word vector

word_list=[]
ner_list=[]
thai2dict = {}

for sent in train_sents:
    for word in sent:
        word_list.append(word[0])
        ner_list.append(word[1])
        
for word in thai2fit_model.index2word:
    thai2dict[word] = thai2fit_model[word]

word_list.append("pad")
word_list.append("unknown") #Special Token for Unknown words ("UNK")
ner_list.append("pad")

all_words = sorted(set(word_list))
all_ner = sorted(set(ner_list))
all_thai2dict = sorted(set(thai2dict))

word_to_ix = dict((c, i) for i, c in enumerate(all_words)) #convert word to index 
ner_to_ix = dict((c, i) for i, c in enumerate(all_ner)) #convert ner to index
thai2dict_to_ix = dict((c, i) for i, c in enumerate(thai2dict)) #convert thai2fit to index 

ix_to_word = dict((v,k) for k,v in word_to_ix.items()) #convert index to word
ix_to_ner = dict((v,k) for k,v in ner_to_ix.items())  #convert index to ner
ix_to_thai2dict = dict((v,k) for k,v in thai2dict_to_ix.items())  #convert index to thai2fit

n_word = len(word_to_ix)
n_tag = len(ner_to_ix)
n_thai2dict = len(thai2dict_to_ix)

แล้วหลังจากนั้นเราจะมาสร้าง Dictionary สำหรับ Character ต่อ ตัวนี้ใช้ character ที่มีทั้งหมดใน thai2fit มา ซึ่งเนื่องจากตัวที่ผมใช้คุณ charin เขาได้ denoise อักขระพิเศษออก ผมเลยมาเพิ่มตัวที่ 3–33 เผื่อไว้

แล้วจริงๆ char2idx มันต้องเขียนเป็น char2idx = {c: i + 34 for i, c in enumerate(chars)} แต่เท่าที่ลองอันนี้มันได้ความแม่นยำเยอะกว่าผมเลยเอาตัวนี้แทนละกัน ไม่ได้ update การทำ experiment ว่า character แบบไหนดีที่สุด ใครคิดว่าแบบไหนดีก็ลองๆทำมาแข่งกันดูได้นะครับ 555 ช่วยๆกัน

chars = set([w_i for w in thai2dict for w_i in w])
char2idx = {c: i + 5 for i, c in enumerate(chars)}

char2idx["pad"] = 0
char2idx["unknown"] = 1
char2idx[" "] = 2

char2idx["$"] = 3
char2idx["#"] = 4
char2idx["!"] = 5
char2idx["%"] = 6
char2idx["&"] = 7
char2idx["*"] = 8
char2idx["+"] = 9
char2idx[","] = 10
char2idx["-"] = 11
char2idx["."] = 12
char2idx["/"] = 13
char2idx[":"] = 14
char2idx[";"] = 15
char2idx["?"] = 16
char2idx["@"] = 17
char2idx["^"] = 18
char2idx["_"] = 19
char2idx["`"] = 20
char2idx["="] = 21
char2idx["|"] = 22
char2idx["~"] = 23
char2idx["'"] = 24
char2idx['"'] = 25

char2idx["("] = 26
char2idx[")"] = 27
char2idx["{"] = 28
char2idx["}"] = 29
char2idx["<"] = 30
char2idx[">"] = 31
char2idx["["] = 32
char2idx["]"] = 33

เย้ เสร็จแล้วก็ save ฮะ ตัว character นี่จำเป็นต้อง save นะครับ ถ้ารันใหม่มัน generate dict ใหม่เลย เดวตัว char indexing จะไม่เหมือนเดิม เวลาใช้ load pickle เอา

with open(Dict_MODEL_PATH+'chardict.pickle', 'wb') as chardict:
    pickle.dump(char2idx, chardict)
with open(Dict_MODEL_PATH+'nerdict.pickle', 'wb') as nerdict:
    pickle.dump(ner_to_ix, nerdict)

ต่อมาก็จะมา set parameter ของคำ ก็จะ fix ประโยค 1 sentence จะมี 250 คำ (เผื่อเยอะมาก ลดได้ตามสะดวกนะครับ แต่คุ้นๆว่ายาวสุดจะมี 100 กว่าคำ เพราะ dataset ชุดแรกที่คุณ Nutcha ทำจะเป็น ข่าวใน BEST corpus แล้ว Doc มันค่อนข้างยาว)

ส่วนคำ 1 คำจะกำหนดมี Character ได้ 30 ตัวอักษร

โดยหากคำหรือตัวอักษรไม่ถึงเราก็จะแทนที่ด้วย PADDING นั่นเอง

max_len = 250
max_len_char = 30

โอเค มาเริ่มเตรียม Training / Testing Dataset กัน เขายก code มาแค่ฝั่งเดียวนะครับ เพราะ Testing ใช้วิธีเดียวกับ Training เลย คนละ Dataset เฉยๆ

## Word Training
X_word_tr = [prepare_sequence_word(s) for s in input_sent]
X_word_tr = pad_sequences(maxlen=max_len, sequences=X_word_tr, value=thai2dict_to_ix["pad"], padding='post', truncating='post')

## Character Training
X_char_tr = []
for sentence in train_sents:
    sent_seq = []
    for i in range(max_len):
        word_seq = []
        for j in range(max_len_char):
            try:
                if(sentence[i][0][j] in char2idx):
                    word_seq.append(char2idx.get(sentence[i][0][j]))
                else:
                    word_seq.append(char2idx.get("unknown"))
            except:
                word_seq.append(char2idx.get("pad"))
        sent_seq.append(word_seq)
    X_char_tr.append(np.array(sent_seq))

## Sequence Label Training
y_tr = [prepare_sequence_target(s) for s in train_targets]
y_tr = pad_sequences(maxlen=max_len, sequences=y_tr, value=ner_to_ix["pad"], padding='post', truncating='post')
y_tr = [to_categorical(i, num_classes=n_tag) for i in y_tr]

สำหรับ Word Training เราจะทำการ map vector จาก thai2fit เลย ส่วนคำไหนที่ไม่มีใน dictionary ก็โยนไปเป็น unknown word แทน เพราะเรามี vector เฉพาะ ~60,000 คำ ส่วน label ก็คล้ายๆ word เลย mapping target กับ dict ที่เราสร้างไว้

def prepare_sequence_word(input_text):
    idxs = list()
    for word in input_text:
        if word in thai2dict:
            idxs.append(thai2dict_to_ix[word])
        else:
            idxs.append(thai2dict_to_ix["unknown"]) #Use UNK tag for unknown word
    return idxsdef prepare_sequence_target(input_label):
    idxs = [ner_to_ix[w] for w in input_label]
    return idxs

ส่วน Character Training ก็จะคล้ายๆกัน ตัวอักษรที่ไม่มีก็จะกลายเป็น unknown ไป เช่นตัวจีน emoji ต่างๆที่ไม่อยู่ใน dictionary

โอเค จบแล้วในส่วนของการเตรียม Dataset ไปดู Model กันเลยยยยยย

Bi-LSTM CRF (Word / Character Embedding) Architecture

ในส่วนนี้จะอธิบาย ส่วน component ต่างๆ ของ Deep Learning Model นะครับ ซึ่ง backend ผมใช้ Keras 2.1.6 เนื่องจากลงตัว version ล่าสุดไปแล้วติดปัญหาอะไรซักอย่าง run ไม่ได้เลย downgrade ลงมา จำไม่ได้แล้วเหมือนกัน 5555

Word-level Representation

เราจะรับ word จาก input size = max_len (250 word)
แล้วโหลดเข้า Embedding ซึ่ง set input_dim = n_thai2dict (55677 word) ขนาด out_dim = 400 ตามขนาด word embedding จาก Thai2Fit แล้วโหลด weight vector จาก thai2fit_weight ซึ่งตัวนี้ผมไม่ได้ train เพิ่มเลยให้เป็น trainable=False ไปนะครับ

Character-level Representation

จะรับ character sequence เข้ามา size array 250,30 แล้วแปลงให้อยู่ในรูป Embedding ของ Character ซึ่งจะ initial One Hot ในตอนแรก โดย input_dim = n_char (399) ซึ่งแต่ละ vector มีขนาด 32

char_embedding_dim = 32
character_LSTM_unit = 32
lstm_recurrent_dropout = 0.5

ซึ่งผมกำหนดแล้วทำการแปลง set ของ Character Embedding ให้เป็น Character Sequence to Vector โดยการจับเข้า Bi-LSTM ซึ่ง Reimers เขาแนะนำมาว่าเนี่ย dropout 0.5 (Variation dropout) จะดีกว่า 0.1 (Naive dropout) นะ (Reference: https://arxiv.org/pdf/1707.06799.pdf) แล้วก็จริงๆทดลองแล้วมันก็ ได้ผลดีกว่าจริงๆฮะ ประมาณ 0.3–0.5 ในทุกๆ experiment ตาม paper ตรง tune hyperparameter เลย แล้วพอผลสุดท้าย เราก็ได้ Character-level Representation จาก Bi-LSTM นะครับ

ซึ่งพอเราได้ Word-level Representation กับ Character-level Representation ก็สามารถใช้เป็น input ของ Main Model เราได้เลย ก็จับมา concat กันเลยฮะ จะได้ vector เป็น size (None, 250, 464) ซึ่งมาจาก 400 word representation + 64 จาก 32+32 BiLSTM Character representation

Main Model: Bi-LSTM + CRF

อันนี้ก็ตาม Deep Learning Model ปกติเลยฮะ ทำ bi-lstm ก่อน ซึ่งผมใช้ Variationdropout = 0.5 เหมือนกับตอนทำ Character Embedding โดยกำหนด activation = relu และ optimmizer = Adam ตาม paper ก่อนหน้านี้เขาแนะนำ Nadam แต่ผมลองแล้วไม่ต่างกันเท่าไร

หลังจาก dense vector สุดท้ายแล้วก็นำไปเข้า CRF เป็นขั้นตอนสุดท้าย เป็นอันเสร็จสิ้น ใครอยากอ่านว่า Bi-LSTM กับ CRF ทำงานยังไง สามารถลองเซิจหาได้นะครับ ขอข้ามไปก่อนเดวยาวไป 555 คนเขียนเยอะละ

# Word Input
word_in = Input(shape=(max_len,), name='word_input_')

# Word Embedding Using Thai2Fit
word_embeddings = Embedding(input_dim=n_thai2dict,
                            output_dim=400,
                            weights = [thai2fit_weight],input_length=max_len,
                            mask_zero=False,
                            name='word_embedding', trainable=False)(word_in)

# Character Input
char_in = Input(shape=(max_len, max_len_char,), name='char_input')

# Character Embedding
emb_char = TimeDistributed(Embedding(input_dim=n_chars, output_dim=char_embedding_dim, 
                           input_length=max_len_char, mask_zero=False))(char_in)

# Character Sequence to Vector via BiLSTM
char_enc = TimeDistributed(Bidirectional(LSTM(units=character_LSTM_unit, return_sequences=False, recurrent_dropout=lstm_recurrent_dropout)))(emb_char)


# Concatenate All Embedding
all_word_embeddings = concatenate([word_embeddings, char_enc])
all_word_embeddings = SpatialDropout1D(0.3)(all_word_embeddings)

# Main Model BiLSTM
main_lstm = Bidirectional(LSTM(units=main_lstm_unit, return_sequences=True, recurrent_dropout=lstm_recurrent_dropout))(all_word_embeddings)
main_lstm = TimeDistributed(Dense(50, activation="relu"))(main_lstm)

# CRF
crf = CRF(n_tag)  # CRF layer
out = crf(main_lstm)  # output

# Model
model = Model([word_in, char_in], out)

model.compile(optimizer="adam", loss=crf.loss_function, metrics=[crf.accuracy])

model.summary()

Result

อันนี้ผมรันทั้งหมด 50 Epoch ได้ Acc ตามนี้ สีแดงเป็น training / สีน้ำเงินเป็น validating ไม่ได้แบ่ง testing มาด้วย ใช้ตัวเดียวกัน

จะเห็นว่าจริงๆ graph มันสุดตั้งแต่ ep. 15 ละ ซึ่งเราสามารถหยุดได้ตั้งแต่แถวนั้นเลย เพราะ acc คงไม่เพิ่มเท่าไรละ +- 1%

เอาล่ะมา ลองดูผลกันว่าเป็นยังไง ผลปรากฎว่า!

                precision    recall  f1-score   support

        B-DATE     0.9558    0.9177    0.9364       401
       B-EMAIL     1.0000    1.0000    1.0000         1
         B-LAW     0.6667    0.7200    0.6923        25
         B-LEN     0.8077    0.9545    0.8750        22
    B-LOCATION     0.8891    0.8971    0.8931       894
       B-MONEY     0.9771    0.9624    0.9697       133
B-ORGANIZATION     0.8958    0.8966    0.8962      1064
     B-PERCENT     0.9429    0.9167    0.9296        36
      B-PERSON     0.9479    0.9353    0.9416       603
       B-PHONE     0.8333    0.9375    0.8824        16
        B-TIME     0.8316    0.8956    0.8624       182
         B-URL     1.0000    0.9545    0.9767        22
         B-ZIP     1.0000    1.0000    1.0000         6
        I-DATE     0.9811    0.9682    0.9746       913
       I-EMAIL     0.8889    1.0000    0.9412         8
         I-LAW     0.6125    0.7424    0.6712        66
         I-LEN     0.8545    0.9592    0.9038        49
    I-LOCATION     0.8940    0.8299    0.8608       935
       I-MONEY     0.9209    0.9831    0.9510       296
I-ORGANIZATION     0.8407    0.8071    0.8235      1327
     I-PERCENT     0.9245    0.9423    0.9333        52
      I-PERSON     0.9653    0.9761    0.9707      2220
       I-PHONE     1.0000    1.0000    1.0000        38
        I-TIME     0.9005    0.9515    0.9253       371
         I-URL     0.9839    0.9946    0.9892       368

     micro avg     0.9183    0.9149    0.9166     10048
     macro avg     0.9006    0.9257    0.9120     10048
  weighted avg     0.9185    0.9149    0.9164     10048
   samples avg     0.0299    0.0299    0.0299     10048

F1 avg = 0.9166 ไม่เลวๆ ถ้าเท่าที่ผมลองมา ปกติใช้ Pure BiLSTM ถ้าไม่ใช้ Word/Char จะได้ประมาณ 80–85% ซึ่งอันนี้ลอง tune hyperparameter ดูแล้วก็ประมาณนี้แหละครับ 90–91.5%

Predict Sentence

ส่วนวิธีการใช้ก็ทำการสร้าง text มาก่อน เริ่มด้วยการตัดคำโดยใช้ newmm ตาม pipeline แล้วก็ทำ Word Vector และ Character Vector สำหรับใส่เข้าไปใน Input Model

text = "นายธนาธรเจอนางสาวยิ่งลักษ์ที่มหาวิทยาลัยจุฬา เช้าวันนี้"

predict_sent = word_tokenize(text,engine='newmm')
len_word = len(predict_sent)

predict_word = []
predict_word = [prepare_sequence_word(predict_sent)]
predict_word = pad_sequences(maxlen=max_len, sequences=predict_word, value=thai2dict_to_ix["pad"], padding='post', truncating='post')

predict_char = convert_word_to_char(predict_sent)

สำหรับ prepare_sequence_word นี่ก็ตามข้างบนเลย แต่ convert_word_to_char นี่ใช้ต่างกันนิดหน่อย เนื่องจากรับมาเป็น sentence เดียวกับ format ไม่เหมือนกัน

def convert_word_to_char(predict_word):
    predict_char = []
    sent_seq = []
    for i in range(max_len):
        word_seq = []
        for j in range(max_len_char):    
            try:
                if(predict_word[i][j] in char2idx):
                    word_seq.append(char2idx.get(predict_word[i][j]))
                else:
                    word_seq.append(char2idx.get("unknown"))
            except:
                word_seq.append(char2idx.get("pad"))
        sent_seq.append(word_seq)
    predict_char.append(np.array(sent_seq))
    
    return predict_char

เค ได้แล้วก็มาเริ่มยัดเข้า Model กันต่อ

result_tag = model.predict([predict_word,np.array(predict_char).reshape((len(predict_char),max_len, max_len_char))])
p = np.argmax(result_tag, axis=-1)
pred=[i for i in p[0]]
revert_pred=[ix_to_ner[i] for i in p[0]]

ผมสุดท้ายได้มาก็จะได้ประมาณนี้นะครับ

Prediction Result

อื้อ ใช้ได้ๆ น่าตาดูดี เสร็จแล้วเย้คล้ายๆตัวอย่างเลย!

What to do NEXT?

อันนี้สามารถเอาไปทำต่อได้นะครับ เช่น Elmo จริงๆทำแล้ว แต่พอดี dataset สำหรับ train ELMo ปล่อยไม่ได้ ใครอยากทราบวิธีก็ไปอ่านนี่ได้นะครับ

State-of-the-art named entity recognition with residual LSTM and ELMo

This is the sixth post in my series about named entity recognition. If you haven't seen the last five, have a look now…

www.depends-on-the-definition.com

แต่ Model ELMo ที่อันนี้ใช้ มันเป็นภาษาอังกฤษ จะต้องทำ ELMo ภาษาไทยแล้ว generate sentence ELMo vector แล้วใช้วิธีการ concat ต่อเพิ่มเหมือนตัวอย่างเลย Accuracy ที่ผมทำก็เพิ่มได้พอสมควร 92–93% หรือใครสงสัยก็หลังไมค์มาถามวิธีได้ครับ

ส่วน BERT นี่ผมไม่ชัวเท่าไรว่าจะ work ไหมเนื่องจากมันใช้ word-piece ถ้าเราจะใช้ก็มี multilingual sentense-piece ฝากใครว่างก็ไปลองให้หน่อยก็นะครับ 5555

อีกเรื่องคือช่วยกันทำ Dataset ครับ อันนี้เป็นเรื่องสำคัญหากจะนำไปใช้จริง น่าจะต้องทำ customize dataset เฉพาะเรื่องๆไปเรื่อย ก็ไปช่วยๆกัน tag ได้ตามลิ้งของน้องข้างบนนะครับ หรือใครมี NER dataset ดีๆก็เอามาแบ่งปันกันได้ฮะ ช่วยกันคนละไม้คนละมือ เพราะจริงๆมันควรจะมี sharetask สำหรับ evaluate ดีๆใช้กันมั่ง

Reference

ผู้สนับสนุนทางความคิดรายใหญ่เกี่ยวกับ NER: https://www.depends-on-the-definition.com/lstm-with-char-embeddings-for-ner/
ThaiNER dataset และตัวอย่างการใช้งานของน้องต้นตาล K.Wannaphong Phatthiyaphaibun: https://github.com/wannaphongcom/thai-ner
คุณ Tisaroj et al. สำหรับ dataset และ ผลวิจัยต่างๆ: http://www.arts.chula.ac.th/~ling/thesis/2553MA-Ling-Nutcha.pdf
Word Embedding ของคุณ Charin: https://github.com/cstorm125/thai2fit
Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks ของ Nils Reimers and Iryna Gurevych: https://arxiv.org/pdf/1707.06799.pdf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF ของ Xuezhe Ma and Eduard Hovy: https://arxiv.org/pdf/1603.01354v5.pdf
Neural Architectures for Named Entity Recognition ของ Guillaume Lample et al.: https://arxiv.org/pdf/1603.01360.pdf

คุยเล่นท้ายบท

สุดท้ายก็ขอขอบคุณที่อ่านจนจบ หวังว่าจะได้ความรู้อะไรไปบ้างนะครับ เขียนครั้งแรกหากขาดตกบกพร่องอะไร หรือมีอะไรแนะนำก็เขียนๆคุยกันได้นะ หรือใครเอาไป Implement ต่อแล้วได้ดีกว่านี้ก็แชร์ๆกันได้นะครับ ผมจะได้ไปใช้มั่ง 5555555

หวังว่าจะว่างเขียน medium ภาษาอังกฤษบ้าง แต่ตอนนี้ขอกลับไปเขียนเล่มจบให้เสร็จก่อนเดี๋ยวเรียนไม่จบซักที Y_Y แล้วเจอกันใหม่หากมีโอกาสฮะ

Thai Named Entity Recognition with BiLSTM-CRF using Word/Character Embedding (Keras)

SuphanutN/Thai-NER-BiLSTM-WordCharEmbedding

Thai Named Entity Recognition with BiLSTM-CRF using Word/Character Embedding …

Named Entity Recognition คืออะไร?

สรุป Survey of Named Entity Recognition and Classification (NERC)

สวัสดีครับทุกคน ก่อนอื่นเลยต้องขอขอบคุณพี่ Guangming C. Sangkeettrakarn มาก ๆ ที่เอา survey มาให้อ่านครับ

Review Model

Enhancing LSTMs with character embeddings for Named entity recognition

This is the fifth in my series about named entity recognition with python. If you haven't seen the last four, have a…

Dataset

wannaphongcom/thai-ner

Thai Named Entity Recognition. Contribute to wannaphongcom/thai-ner development by creating an account on GitHub.

กรอกข้อมูล : โครงการคลังข้อมูล NER ภาษาไทย

Edit description

Preprocess and Prepare word / character vector

cstorm125/thai2fit

ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of…

Bi-LSTM CRF (Word / Character Embedding) Architecture

Result

Predict Sentence

What to do NEXT?

State-of-the-art named entity recognition with residual LSTM and ELMo

This is the sixth post in my series about named entity recognition. If you haven't seen the last five, have a look now…

Reference

คุยเล่นท้ายบท

Written by Suphanut Thattinaphanich