Bilingual (Thai-English) Clinical Assertion and Negation Classification

7 min readNov 4, 2023

Project Background
Dataset
Methodology
- วิธีที่ 1 : Rule-based
- วิธีที่ 2 : Machine Learning
— วิธีที่ 2.1 Naive Bayes Classifier
— วิธีที่ 2.2 Bidirectional Long Short-Term Memory (BiLSTM)
- วิธีที่ 3 : Prompt Engineering with LLM
Evaluation
Discussion
Conclusion and Future Directions
References/Related Links
- Libraries used in Jupyter Notebook

Project Background

เวชระเบียนผู้ป่วยมีประโยชน์ในการทำวิจัยและระบบ Clinical Decision Support อย่างไรก็ตาม เวชระเบียนส่วนใหญ่บันทึกอยู่ในรูปแบบ free-text ซึ่งยากต่อการประมวลผลต่อ

การใช้ NLP (Natural Language Processing) มาประมวลผล free-text data ให้เป็น structured data (เช่น tabular data) ทำให้ประมวลผลต่อได้ง่ายขึ้น
ระบบ NER (Named-Entity Recognition) เป็น task หนึ่งของ NLP ซึ่งสามารถนำมาใช้ detect concept ทางการแพทย์ เช่น อาการ การวินิจฉัย หัตถการ การรักษา และยา

การบันทึกเวชระเบียน มักมีทั้งการบันทึก Positive และ Negative findings การทำ Assertion Negation Classification หลังผ่านการ process ด้วย NER จะช่วยให้ระบุได้ว่า concept ที่ detect ได้นั้น เป็น positive หรือ negative

ตัวอย่าง สังเกตว่าการค้นหาศัพท์โดยตรงเพียงอย่างเดียว ยังไม่สามารถแยก positive และ negative findings ได้

ตัวอย่าง pipeline การ process free-text ด้วย NER เพื่อ detect concept ทางการแพทย์ และ Assertion/Negation Classification เพื่อระบุว่า concept ที่ detect ได้นั้น เป็น positive หรือ negative

ในปัจจุบันมีระบบที่รองรับภาษาอังกฤษและภาษาอื่นๆ (เช่น Clinical Assertion / Negation Classification BERT, Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods) สำหรับภาษาไทยนั้น เท่าที่สืบค้นยังไม่พบว่ามีโมเดลหรือระบบใดที่ทำ task นี้โดยตรง จึงเป็นที่มาของ project นี้ที่ต้องการ explore วิธีต่างๆ ในการสร้างระบบนี้ขึ้น

Dataset

ใน project นี้ ใช้ข้อมูลผู้ป่วยจำลอง ที่เลียนแบบลักษณะเวชระเบียนผู้ป่วย โดยข้อมูลทั้งหมดนั้นสร้างขึ้นมาเอง ไม่มีการใช้ข้อมูลจริงจากเวชระเบียน ชื่อบุคคลและองค์กรที่อาจปรากฏในข้อมูล ไม่ได้มาจากแหล่งข้อมูลจริง

แบ่งข้อมูลเป็น 3 ส่วน

Human generated and annotated data for train — Annotated data (272 concepts)
ChatGPT generated data (with manual edit) — Generated data (770 concepts)
Human generated and annotated data for test — Test data (50 concepts)

ข้อมูลที่สร้างขึ้นเองโดยแพทย์ แบ่งเป็น 2 ส่วน เพื่อใช้ในการ train — เรียกข้อมูลส่วนนี้ว่า Annotated data และอีกส่วนเพื่อใช้ในการ test — เรียกส่วนนี้ว่า Test data โดยข้อมูลทั้ง 2 ส่วนนี้ ถูกนำมา label concept ทางการแพทย์ที่ปรากฏในข้อความ รวมถึง label ว่า concept นั้นๆ เป็น positive/มี หรือ negative/ไม่มี

เนื่องจากข้อมูลที่มีค่อนข้างน้อย จึงได้ทดลองสร้างข้อมูลจำลองอีกส่วนหนึ่งขึ้นมา โดยใช้ ChatGPT ช่วยสร้างข้อมูลผู้ป่วยจำลองเป็นภาษาอังกฤษ พร้อมกับ label concept และ positive/negative จากนั้นใช้ Google Translate ในการแปลเป็นภาษาไทย ได้ข้อมูลรวมทั้งหมด 770 concept

ข้อมูลทั้งหมดดัดแปลงให้เป็นไฟล์ CSV ซึ่งประกอบด้วย 5 column คือ

Text : ข้อมูลผู้ป่วยจำลองในรูป free text
Concept : คำสำคัญทางการแพทย์ ที่พบใน Text ข้างต้น
Start และ End : คือตำแหน่งตัวอักษรเริ่มต้นและสิ้นสุดของ Concept นั้นๆ
และ Label : ระบุว่า Positive หรือ Negative คือการ มี หรือ ไม่มี concept นั้นๆ

ด้านล่างเป็นตัวอย่างข้อมูล 1 concept — “ผื่นคัน” เป็นตัวอักษรตำแหน่งที่ 53 ถึง 60 ใน text โดย label เป็น Positive

text,concept,start,end,label
"ชาย 24 ปี ปฏิเสธโรคประจำตัว ปฏิเสธประวัติแพ้ยา
มาด้วยผื่นคัน 2 สัปดาห์
2 สัปดาห์ก่อน มีผื่นแดงที่แขนข้างซ้ายบริเวณข้อพับ ไม่คัน ไม่มีไข้ ไปร้านยา ซื้อยากินเอง หลังจากนั้นอาการดีขึ้นบ้าง แต่ยังไม่หาย
1 สัปดาห์ก่อน ผื่นแดงลามมากขึ้น จากแขนซ้าย เป็นแขนทั้งสองข้าง หลังจากนั้นผื่นแดงขึ้นตามตัว และขา 2 ข้าง อาการคันเป็นมากขึ้น
อาการอื่นปกติ ไม่มีปวดท้อง ไม่มีคลื่นไส้อาเจียน
3 วันก่อน ผื่นคันมากขึ้น กินยาแล้วไม่ดีขึ้น วันนี้จึงมารพ.
ปฏิเสธประวัติแพ้อาหารหรือแพ้สารเคมีใดๆ

PE
V/S : T 37.2 C, P 82/min, RR 14/min, BP 110/80 mmHg
GA : Alert, not pale, no jaundice
RS : clear both lungs
Abd : soft, not tender
Skin : Generalized erythematous papule
NS: WNL

Imp: Rash
Plan supportive
advice ถ้าอาการยังไม่ดีขึ้นให้มาตรวจซ้ำ",ผื่นคัน,53,60,Positive

แผนภูมิแสดงสัดส่วนและจำนวนข้อมูล Label Positive และ Negative ของแต่ละ Dataset

Methodology

ลอง 3 วิธี

Rule-based
เลือกวิธีนี้ก่อนเป็นพื้นฐาน เนื่องจากสามารถทำได้โดยไม่ต้องใช้ training data
Machine Learning
เป็นวิธีที่ทุ่นแรง เพราะไม่ต้องสร้างเงื่อนไขการจำแนกเองทั้งหมด
Prompt Engineering with Large Language Model (LLM)
ไม่ต้องสร้างเงื่อนไขเอง และไม่ต้องใช้ training data เนื่องจากเป็น model ที่ผ่านการ train ความเข้าใจทางภาษามาแล้ว

สำหรับ code ที่ใช้ในส่วน Methodology รวบรวมอยู่ใน Jupyter Notebook นี้

Thai-English_Clinical_Assertion_Negation_Classification

colab.research.google.com

วิธีที่ 1 : Rule-based

ตัดคำโดยใช้ word-based tokenizer
ตรวจสอบเงื่อนไขจากคำที่อยู่ใกล้กับ concept เพื่อจำแนกว่าเป็น positive หรือ negative

def rule_based_classifier_1(inputs):
  text, pos_start, pos_end = inputs
  left_tokens = word_tokenize(text[:pos_start])
  n_left_tokens = len(left_tokens)
  if n_left_tokens >= 2 :
    if left_tokens[-2] in ['no', 'not'] :
      return 'Negative'
  elif n_left_tokens >= 1 :
    if left_tokens[-1] in ['ไม่'] :
      return 'Negative'
  return 'Positive'

ตัวอย่าง Rule-based classifier ที่ใช้คำที่อยู่ข้างหน้า concept ในการจำแนก โดยใช้เงื่อนไขว่า ถ้ามีคำว่า ‘ไม่’ หรือ ‘no’ หรือ ‘not’ อยู่ข้างหน้า จะ classify ว่าเป็น Negative กรณีนอกเหนือจากนี้จะ classify เป็น Positive

ซ้ายมือ : เงื่อนไขที่กำหนด, ขวามือ : ตัวอย่างกรณีที่จำแนกผิด

นำมาทดสอบกับ Annotated data พบว่ามีบางเงื่อนไขที่ยังจำแนกผิด เช่น กรณีที่มีคำแทรกอยู่ระหว่างคำว่า ‘ไม่’ กับ concept หรือเป็นคำอื่น เช่น ‘ปฏิเสธ’ จะไม่ตรงเงื่อนไขที่ตั้งไว้ ทำให้ classify ผิด

ลองเพิ่มเงื่อนไข เพื่อให้ครอบคลุมมากขึ้น

บางกรณีอาจจะเขียนเงื่อนไขค่อนข้างยาก เช่น การใช้คำเชื่อม หรือ และ หรือการเว้นวรรค แต่ยังมีความหมายต่อเนื่องจากคำก่อนหน้า เป็นต้น

def rule_based_classifier_3(inputs):
  text, pos_start, pos_end = inputs
  left_tokens = word_tokenize(text[:pos_start])
  n_left_tokens = len(left_tokens)
  if n_left_tokens >= 2 :
    if left_tokens[-1] in ['ไม่', 'ปฏิเสธ'] or left_tokens[-2] in ['ไม่', 'ปฏิเสธ', 'no', 'not'] :
      return 'Negative'
  elif n_left_tokens >= 1 :
    if left_tokens[-1] in ['ไม่', 'ปฏิเสธ'] :
      return 'Negative'
  return 'Positive'

ตัวอย่าง Rule-based classifier ที่ใช้คำที่อยู่ข้างหน้า concept ในการจำแนก โดยใช้เงื่อนไขว่า ถ้ามีคำว่า ‘ไม่’ หรือ ‘ปฏิเสธ’ อยู่ 1 หรือ 2 คำก่อนหน้า หรือ ‘no’ หรือ ‘not’ อยู่ข้างหน้า จะ classify ว่าเป็น Negative กรณีนอกเหนือจากนี้จะ classify เป็น Positive

ทดสอบ rule_based_classifier_3 กับ Test Data ได้
Accuracy: 0.94
Precision: 0.945
Recall: 0.94
F1_score: 0.939

วิธีที่ 2 : Machine Learning

ได้ลอง 2 วิธีย่อย คือ
2.1 Pure statistical model → Naive Bayes Classifier
2.2 Neural model → Bidirectional Long Short-Term Memory (BiLSTM),
ใช้ MetaCAT ใน MedCAT library (https://github.com/CogStack/MedCAT)

วิธีที่ 2.1 Naive Bayes Classifier

ตัดคำโดยใช้ word-based tokenizer
Extract features จาก list คำที่ได้
Train model Naive Bayes Classifier

def concept_features(item):
  (text, start, end, left_tokens, right_tokens) = item

  features = {}

  features["first_left_context"] = left_tokens[-1] if len(left_tokens) > 0 else ''
  features["2nd_left_context"] = left_tokens[-2] if len(left_tokens) > 1 else ''
  features["3rd_left_context"] = left_tokens[-3] if len(left_tokens) > 2 else ''
  features["first_right_context"] = right_tokens[0] if len(right_tokens) > 0 else ''
  features["2nd_right_context"] = right_tokens[1] if len(right_tokens) > 1 else ''
  features["3rd_right_context"] = right_tokens[2] if len(right_tokens) > 2 else ''

  return features

ตัวอย่างการ extract features โดยรับ input ที่ผ่านการตัดคำ โดยแยก list คำด้านหน้าและ คำด้านหลัง แล้ว

ลอง train model ด้วยข้อมูล 2 ชุด คือ Annotated Data และ Generated Data
เลือก features เป็น 3 คำที่อยู่ด้านหน้าและหลัง concept

train classifier — classifier = nltk.NaiveBayesClassifier.train(train_set)

แสดง Most informative features — classifier.show_most_informative_features(5)

model NaiveBayesClassifier ที่ train ด้วย Annotated Data

Most Informative Features
        2nd_left_context = 'ไม่'          Negati : Positi =     22.7 : 1.0
       3rd_right_context = 'ไม่'          Negati : Positi =     11.3 : 1.0
        3rd_left_context = 'ไม่'          Negati : Positi =      7.7 : 1.0
      first_left_context = 'มี'           Negati : Positi =      5.7 : 1.0
       3rd_right_context = 'มี'           Negati : Positi =      4.4 : 1.0

สังเกตว่า model ได้เรียนรู้ว่า feature ที่สำคัญคือ มีคำว่า ‘ไม่’ หรือ ‘มี’ อยู่ใกล้กับ concept

สำหรับ model NaiveBayesClassifier ที่ train ด้วย Generated Data

Most Informative Features
       3rd_right_context = 'Upon'        Negati : Positi =     35.2 : 1.0
       3rd_right_context = 'ประวัติศาสตร์'   Negati : Positi =     32.1 : 1.0
        2nd_left_context = 'ไข้'          Negati : Positi =     17.9 : 1.0
     first_right_context = 'และ'         Positi : Negati =      8.8 : 1.0
        3rd_left_context = 'ที่'           Negati : Positi =      7.7 : 1.0

สังเกตว่า feature ที่ model ได้เรียนรู้ว่า มี noise คาดว่าเกิดจากข้อมูลที่ทดลองสร้างจาก ChatGPT มีรูปแบบซ้ำๆ

ทดสอบ model ที่ train ด้วย Annotated Data กับ Test data ได้
Accuracy: 0.7
Precision: 0.739
Recall: 0.7
F1_score: 0.706

วิธีที่ 2.2 Bidirectional Long Short-Term Memory (BiLSTM)

สำหรับวิธีนี้จะใช้ model ที่ implement ไว้ใน class MetaCAT ของ library MedCAT ซึ่งใช้ PyTorch.nn.LSTM เป็น model หลัก (ดูรายละเอียดใน https://github.com/CogStack/MedCAT/blob/e52bda3547dfa61c671727746058f67a21da3576/medcat/utils/meta_cat/models.py#L11C23-L11C23)

ขั้นตอนคร่าวๆ ดังนี้

train tokenizer โดยใช้ข้อมูลจาก Annotated Data (ใช้ByteLevelBPETokenizer)
สร้าง Word Embeddings ด้วย library word2vec (ใช้ vector_size=300)
train MetaCAT model ด้วย default config (ใช้ nepochs=50)

สำหรับรายละเอียดการใช้ library MedCAT สามารถดูเพิ่มเติมได้ที่ https://github.com/CogStack/MedCATtutorials/ (part 4.1–4.3 เกี่ยวกับ MetaCAT)

ทดสอบ model MetaCAT ที่ได้กับ Test data ได้
Accuracy: 0.78
Precision: 0.776
Recall: 0.78
F1_score: 0.775

วิธีที่ 3 : Prompt Engineering with LLM

ได้ลอง LLM 3 ตัว ซึ่งเป็น model ที่เปิดให้ใช้ฟรีคือ

ChatGPT GPT-3.5 — https://chat.openai.com/
Google Bard — https://bard.google.com/
Claude — https://claude.ai/

ด้วยข้อจำกัดของ LLM ที่ใช้ tokens เป็น subword (คำย่อย) ทำให้การ input โดยระบุตำแหน่งตัวอักษรโดยตรง ได้ผลที่ไม่ค่อยดีนัก จึงได้ปรับ Input เป็น Text ที่ปิดหัวท้ายของแต่ละ concept ด้วย tag <entity> แทน ดังตัวอย่าง

"ชาย 24 ปี ปฏิเสธโรคประจำตัว ปฏิเสธประวัติ<entity>แพ้ยา</entity>
มาด้วย<entity>ผื่นคัน</entity> 2 สัปดาห์
2 สัปดาห์ก่อน มี<entity>ผื่นแดง</entity>ที่แขนข้างซ้ายบริเวณข้อพับ ไม่<entity>คัน</entity> ไม่<entity>มีไข้</entity> ไปร้านยา ซื้อยากินเอง หลังจากนั้นอาการดีขึ้นบ้าง แต่ยังไม่หาย
1 สัปดาห์ก่อน <entity>ผื่นแดง</entity>ลามมากขึ้น จากแขนซ้าย เป็นแขนทั้งสองข้าง หลังจากนั้น<entity>ผื่นแดง</entity>ขึ้นตามตัว และขา 2 ข้าง <entity>อาการคัน</entity>เป็นมากขึ้น
อาการอื่นปกติ ไม่มี<entity>ปวดท้อง</entity> ไม่มี<entity>คลื่นไส้</entity><entity>อาเจียน</entity>
3 วันก่อน <entity>ผื่นคัน</entity>มากขึ้น กินยาแล้วไม่ดีขึ้น วันนี้จึงมารพ.
ปฏิเสธประวัติ<entity>แพ้อาหาร</entity>หรือ<entity>แพ้สารเคมีใดๆ</entity>

PE
V/S : T 37.2 C, P 82/min, RR 14/min, BP 110/80 mmHg
GA : Alert, not <entity>pale</entity>, no <entity>jaundice</entity>
RS : <entity>clear</entity> <entity>both lungs</entity>
Abd : <entity>soft</entity>, not <entity>tender</entity>
Skin : <entity>Generalized</entity> erythematous <entity>papule</entity>
NS: WNL

Imp: <entity>Rash</entity>
Plan supportive
advice ถ้าอาการยังไม่ดีขึ้นให้มาตรวจซ้ำ"

ได้ทดลองปรับ prompt โดยใช้ ChatGPT เป็นโมเดลหลักในการปรับ ได้ prompt ดังด้านล่างนี้ ใช้เป็น Task ให้ LLM ทั้ง 3 ตัว
โดยใส่ Input Text ใน [Input Text Here]

Your task is to classify the status of each concept as "Positive" or "Negative"; "Positive" if that concept is present or affirmed, and classify as "Negative" if that concept is absent or negated. The input is formatted in text with each concept enclosing with <entity> and ended with </entity> and output is the same text with changed of <entity>concept</entity> to <positive>positive concept</positive> or <negative>negative concept</negative> respectively to the class of concept Example: Input: "Example sentence has <entity>hypertension</entity>, but not <entity>diabetes</entity>" Should give Output: "Example sentence has <positive>hypertension</positive>, but not <negative>diabetes</negative>"
Please give the output of this task: Input: "[Input Text Here]"

มีผลลัพธ์ที่คาดหวัง (จากตัวอย่าง Input ข้างต้น) เป็น

"ชาย 24 ปี ปฏิเสธโรคประจำตัว ปฏิเสธประวัติ<negative>แพ้ยา</negative>
มาด้วย<positive>ผื่นคัน</positive> 2 สัปดาห์
2 สัปดาห์ก่อน มี<positive>ผื่นแดง</positive>ที่แขนข้างซ้ายบริเวณข้อพับ ไม่<negative>คัน</negative> 
ไม่<negative>มีไข้</negative> ไปร้านยา ซื้อยากินเอง หลังจากนั้นอาการดีขึ้นบ้าง แต่ยังไม่หาย
1 สัปดาห์ก่อน <positive>ผื่นแดง</positive>ลามมากขึ้น จากแขนซ้าย เป็นแขนทั้งสองข้าง หลังจากนั้น<positive>ผื่นแดง</positive>ขึ้นตามตัว และขา 2 ข้าง <positive>อาการคัน</positive>เป็นมากขึ้น
อาการอื่นปกติ ไม่มี<negative>ปวดท้อง</negative> ไม่มี<negative>คลื่นไส้</negative><negative>อาเจียน</negative>
3 วันก่อน <positive>ผื่นคัน</positive>มากขึ้น กินยาแล้วไม่ดีขึ้น วันนี้จึงมารพ.
ปฏิเสธประวัติ<negative>แพ้อาหาร</negative>หรือ<negative>แพ้สารเคมีใดๆ</negative>
PE
V/S : T 37.2 C, P 82/min, RR 14/min, BP 110/80 mmHg
GA : Alert, not <negative>pale</negative>, no <negative>jaundice</negative>
RS : <positive>clear</positive> <positive>both lungs</positive>
Abd : <positive>soft</positive>, not <negative>tender</negative>
Skin : <positive>Generalized</positive> erythematous <positive>papule</positive>
NS: WNL
Imp: <positive>Rash</positive>
Plan supportive
advice ถ้าอาการยังไม่ดีขึ้นให้มาตรวจซ้ำ"

ผลทดสอบกับ Test Data

Evaluation

ตารางแสดงการเปรียบเทียบค่าสถิติของวิธีต่าง ๆ เมื่อทดสอบกับ Test Data; ค่าที่ขีดเส้นใต้แสดงค่าที่สูงที่สุดของค่าสถิตินั้น ๆ

Discussion

ข้อมูลที่ใช้ train และ test เป็นข้อมูลจำลองที่สร้างขึ้นเอง ผลที่ทดสอบกับข้อมูลเวชระเบียนจริงอาจแตกต่างไป
ข้อมูลที่ใช้ train ค่อนข้างน้อย อาจทำให้ performance ของ Machine Learning model ไม่ดีเท่าที่ควร

Conclusion and Future Directions

จากผลการทดสอบด้วยข้อมูลจำลอง วิธี Prompt Engineering with LLM ด้วยโมเดล Claude
มีแผนที่จะทดสอบกับข้อมูลเวชระเบียนจริงต่อไป
อาจเพิ่มเป็น 3 class : present, possible, absent
อาจใช้โมเดลผสม Rule-based กับ Machine Learning
โดยใช้ Rule-based classifier คัดกรองรูปแบบที่เป็น Negative ขั้นหนึ่งก่อน
กรณีที่เหลือ ใช้ Machine Learning Model หรือ LLM ในการ classify
อาจลองใช้ On-premise LLM เพื่อไม่ให้มีปัญหาเรื่อง Data Privacy
ลอง Sentence segmentation เพื่อตัดประโยค อาจทำให้ classify ได้ง่ายขึ้น