WangChanGLM🐘 — The Multilingual Instruction-Following Model

We set out to improve instruction-following capabilities in Thai, our native language, and ended up discovering one of today’s AI most urgent and critical inequality issues: the shape of subword tokens.

Published in

AIResearch.in.th

16 min readApr 29, 2023

It is impossible to not get caught up in the ChatGPT-inspired instruction-following model bandwagon. The wheel was set into motion by Facebook’s release of LLaMA, a set of 7B, 13B, 33B, and 65B pretrained models on one trillion, mostly English, tokens and supercharged by the fact that you can turn it into “ChatGPT” by simply finetuning on 50k instruction pairs, aka the Alpaca movement. This, and the low-resource finetuning techniques we know and love like LoRA, has given everyone with at least a few high-end, consumer-grade GPUs the chance to homebrew their own ChatGPTs. At PyThaiNLP and AIResearch.in.th, we are excited to get our piece of the action, but was abruptly reminded of the harsh reality that non-latin, scriptio-continua (fancy way of saying languages with no word and sentence boundary) languages are not at all compatible with instruction followers finetuned from the LLaMA family and other large causal language models (>5B parameters; LLaMA, Pythia, GPT-J).

This is not an unusual problem for us per se. For example, back in 2021, we pretrained our own RoBERTa-variant WangchanBERTa with special space tokens <_> in order to account for the fact that Sentencepiece sometimes omit spaces, which for languages that use spaces simply as word boundaries would be fine but not for Thai which uses spaces as punctuations. This minor modification allows us to outperform XLM-RoBERTa, a very strong multilingual baseline, for token classification tasks.

However, this time we cannot pretrain our own LLaMA with our current resource limitations. We scoured through all the multilingual pretrained models and literature in order to find any hint that can help us build an instruction-following model that works for Thai. We ended up finding a not-yet-ChatGPT-but-promising solution that works not only for Thai but potentially 30 languages, using an old multilingual language model (first released end of 2021, ancient by today language model research standards) and the concept of cross-lingual knowledge transfer. We demonstrated our models’ current capabilities and discuss what we could do to improve them. Down this rabbit hole, we also started questioning whether it is possible that the capabilities to understand non-English languages exist in these English-dominant models but we simply cannot retrieve them by our current methodology. We end by proposing some possible solutions going forward, not only for Thai but also all non-latin, scriptio-continua languages.

Introducing WangChanGLM 🐘

WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an English-only dataset (wangchanglm-7.5B-sft-en) and another checkpoint further finetuned on Google-Translated Thai dataset (wangchanglm-7.5B-sft-enth). We perform Vicuna-style evaluation using both humans and ChatGPT (in our case, gpt-3.5-turbo since we are still on the waitlist for gpt-4) and observe some discrepancies between the two types of annoators. All training and evaluation codes are shared under the Apache 2.0 in our Github, as well as datasets and model weights on HuggingFace. In a similar manner to Dolly v2, we use only use open-source, commercially permissive pretrained models and datasets, our models are neither restricted by non-commercial clause like models that use LLaMA as base nor non-compete clause like models that use self-instruct datasets from ChatGPT. See our demo here.

🌸 Cherry-picked Showcase

Here we presented a cherry-picked set of examples where wangchanglm-7.5B-sft-en performs well (and not so well) in reading comprehension, brainstorming, and creative writing. The languages are Thai and Japanese since the authors are only capable of qualitatively evaluate these two languages. You can try replicating these answers in Text Generation section of the demo (not Chatbot).

Reading Comprehension

<human>: “สถาบันวิจัยปัญญาประดิษฐ์ประเทศไทย ได้ทำการเปิดชุดข้อมูลคู่ประโยคในภาษาอังกฤษ-ไทย จำนวนกว่า 1 ล้านคู่ประโยคสู่สาธารณะ โดยได้รับการสนับสนุนจาก SCB ภายใต้ชื่อ scb-mt-en-th-2020 ชุดข้อมูลคู่ประโยคนี้ ได้รวบรวมจากหลายข้อมูลแหล่งอาทิเช่น ประโยคจากบทสนทนา ข้อมูลจากเว็บไซต์ข่าวหรือองค์กรที่มีเนื้อหาในสองภาษา บทความวิกิพีเดีย และ เอกสารราชการ”
สรุปข้อความนี้ให้หน่อย
<bot>: สถาบันฯ ได้เปิดข้อมูลชุดนี้สู่สาธารณชนภายใต้การสนับสนุนจากธนาคารไทยพาณิชย์ โดยใช้ชื่อชุดว่า scbmt-2020
<context>: เอริค เทน ฮาก กุนซือ แมนเชสเตอร์ ยูไนเต็ด คิดว่าทีมของเขาไม่ได้มีฟอร์มที่ดีมากนัก ทั้งเสียประตูง่ายและขาดความเฉียบคม แต่โดยรวมยังรู้สึกว่าการบุกเสมอ ท็อตแน่ม ฮอตสเปอร์ คือผลสกอร์ที่ดี ภายหลังผ่านเข้าชิง เอฟเอ คัพ สำเร็จ “ปีศาจแดง” กลับมาลงเล่นบอลลีคเมื่อวันพฤหัสที่ผ่านมา บุกนำห่าง “ไก่เดือยทอง” ตั้งแต่หมดครึ่งแรก สุดท้ายโดนเจ้าบ้านไล่เจ๊า 2–2 ผลเสมอทำให้ ยูไนเต็ด ยังนำท็อปโฟร์ด้วยระยะห่าง 6 แต้มเท่าเดิม แถมลงเล่นน้อยกว่าคู่แข่งทีมอื่น และหลังจบเกม เทน ฮาก ให้สัมภาษณ์เผยความรู้สึก “แน่นอนว่าเมื่อนำ 2–0 แล้วจบด้วยผลเสมอทำให้รู้สึกผิดหวังเล็กน้อย แต่สำหรับสัปดาห์นี้ ผมคิดว่าสกอร์ 2–2 เป็นผลเสมอที่ดี หลังจากโดนตีไข่แตก เรามีโอกาสหลายครั้งเพื่อทิ้งห่าง 3–1 แต่เราทำไม่ได้ สุดท้ายเราเสียประตูเพิ่ม พูดตามตรงคือ ช่วงที่เราพยายามยิงประตูที่สอง สเปอร์ส ก็มีโอกาสตีเสมอเราได้เช่นกัน”
<human>: พาดหัวข่าวนี้ให้หน่อย
<bot>: แมนฯ ยูไนเต็ด บ่นฟอร์มการเล่น แต่ยังพอใจผลการแข่งขัน
(To generate Japanese, don’t forget to allow the demo to generate non-Thai characters from Colab notebook)
<context>: 欧州連合（EU）の行政府である欧州委員会のベステアー上級副委員長が28日夜、朝日新聞のインタビューに応じた。「ChatGPT（チャットGPT）」など生成AI（人工知能）が作り出した文章や画像に表示を義務づける考えを明らかにした。EUの理事会と議会で審議中のAI規制法案に盛り込む方針で、年内の合意をめざす。生成AIは、インターネット上の膨大な文章や画像イメージを「学習」したAIが、人間の要求に応じて画像や文章を作り出す技術。昨年11月に米新興企業オープンAIが公開したチャットGPTの利用が世界で急速に広がるなか、偽情報の拡散やプライバシー侵害への懸念が高まっている。
<human>: この記事の見出しを書いてください。
<bot>: 人工AIの規模が拡大するなかで生じる問題を解決するため、欧州市議会の議員が規則を制定する考えだ。
<context>: 山深くに棲み、黒くて長い毛をした猿のような、日本に広く伝わる妖怪。和歌山県ではヒヒのようだともいう。覚には人の心を読み取る能力があるらしく、覚を捕らえてやろうと考えると、それを察して逃げ出してしまう。熊野では山で亡くなった死者の霊魂がさまよえば、大人であればダルになり、子どもであればコダマになるといわれ、コダマが少年少女の姿で現れたものをサトリと呼ぶ。やはりこちらのサトリも人の心を読み、心の中で思っていることを言ったり、人が言おうとしていることを先に言ったりして惑わすらしい。
<human>: この妖怪は何の能力を持っている？
<bot>: 覚は人の気持ちを読んで、人の言動を予測する能力を持つ。

Brainstorming

<human>: อยากลดความอ้วนทำไง
<bot>: อันดับแรกคุณต้องตั้งเป้าหมายในการลดน้ําหนักให้ได้ และต้องมีความมุ่งมั่นในการลดไขมันในร่างกายให้ได้ตามเป้าหมายที่ตั้งไว้ การออกกําลังกายอย่างสม่ำเสมอจะช่วยเพิ่มอัตราการเผาผลาญพลังงานในร่างกาย และช่วยลดปริมาณไขมันที่สะสมในร่างกายได้ นอกจากนี้การรับประทานอาหารที่มีประโยชน์ต่อร่างกายอย่างเพียงพอ จะช่วยเสริมสร้างกล้ามเนื้อให้แข็งแรง และช่วยให้การลดน้ําหนักมีประสิทธิภาพมากขึ้น
<human>: ทำยังไงถึงจะเล่นหุ้นให้ได้กำไร
<bot>: การเล่นหุ้นให้ประสบความสําเร็จนั้น คุณต้องศึกษาข้อมูลให้ดีก่อนการลงทุน วางแผนการลงทุนอย่างรอบคอบ ติดตามข่าวสารการลงทุน และใช้เครื่องมือวิเคราะห์หุ้นอย่างเหมาะสม ที่สําคัญคือต้องรู้จักอดทนและมีความมุ่งมั่นในการลงทุน
(To generate Japanese, don’t forget to allow the demo to generate non-Thai characters from Colab notebook)
<human>: リア充になるにはどうすればいいのですか？
<bot>: それはあなた次第です. もしリアル友達が欲しいなら、積極的にコミュニケーションを取って、一緒に遊びましょう.
<human>: ネットでお買い物するならどのサイトが一番いいの？
<bot>: ネットショッピングはAmazonがおすすめです。Amazonは日本最大級のネットショップで、商品数も豊富で価格も安く、また送料も無料なのでお得です。

Creative Writing

<human>: จงแต่งเรียงความเรื่องความฝันของคนรุ่นใหม่ต่อประเทศไทย
<bot>: ประเทศไทยในยุคปัจจุบัน เต็มไปด้วยความเปลี่ยนแปลงอย่างรวดเร็ว เทคโนโลยีที่ก้าวหน้าอย่างรวดเร็ว ทําให้คนรุ่นใหม่มีความคิดที่เปิดกว้างและกล้าที่จะคิดต่างจากคนรุ่นก่อนๆ พวกเขามีมุมมองที่กว้างไกลและมองโลกในแง่ดี กล้าคิด พูด และแสดงออกในสิ่งที่แตกต่างจากเดิม หลายคนมีความฝันที่จะสร้างสิ่งดีๆ ให้กับประเทศชาติ บางคนมีความมุ่งมั่นที่จะเปลี่ยนแปลงสังคมให้ดีขึ้น ในขณะที่บางคนก็มุ่งหวังที่จะประสบความสําเร็จในหน้าที่การงาน ทุกคนต่างมีเป้าหมายเดียวกันคือ ต้องการเห็นประเทศไทยเจริญก้าวไกล พัฒนาไปอย่างมั่นคงและยั่งยืน
<human>: แต่งกลอนแปดวันแม่
<bot>: กล่อมลูกนอนหลับ ยามค่ำคืน ให้ลูกนอนฝันดี นอนหนุนหมอนใบโต อุ่นไอรักจากแม่ คอยโอบกอดลูกไว้ รักลูกเสมอไม่เปลี่ยนไป สุขสันต์วันเกิดแม่
(To generate Japanese, don’t forget to allow the demo to generate non-Thai characters from Colab notebook)
<human>: 鳥山明への感謝の手紙を書いて
<bot>: 僕は鳥さんを尊敬しています。鳥さんは僕の人生を変えました。僕が鳥さんの漫画を読んでから、僕はこの世界に興味を持つようになりました。そして僕も鳥さんと共に、この世界をより良くする為に努力していきたいです。
<human>: 仕事で日々疲れ果てたサラリーマンに励ましの言葉を長文で書いてください。
<bot>: 僕は、君が仕事に疲れていて、心身ともに疲れていることを知っている。君は僕の友達で、僕も君の友人だ。僕が君に伝えたいことは、仕事は疲れるけど、疲れたからといって、仕事を辞めるべきではないということだ。「仕事」という言葉は、「仕事=仕事の内容」ではなく、「仕事をする=自分の人生を生きる」ということを意味する。仕事が辛い、辛すぎる、というときは、その辛さを「仕事という仕事から解放される」ために、自分を変えよう。そうすれば、きっと、少しずつ、自分の心が楽になってくるはずだ!

Limitations

We noticed similar limitations to other finetuned instruction followers such as math problems, reasoning, and factfulness. Even though the models do not perform on the level that we expect them to be abused, they do contain undesirable biases and toxicity and should be further optimized for your particular use cases.

🐐 Evaluating Language Models, the Vicuna Style

We attempted to quantitatively evaluate the model performance at scale using the methodology proposed by Vicuna. At first, we explored using the human-translated Vicuna test set; however, we found that a substantial portion of it contains examples which do not represent our intended use cases such as math and coding problems. Therefore, we created our benchmark using OpenAssistant OASST1, a human-generated, human-annotated, assistant-style conversation corpus. From this dataset, we subsetted 245 prompts from Thai (166), Japanese (42) and Vietnamese (37). For each prompt, we pick the answer with the highest net +1 votes as our human answer. We intentionally picked mid-to-low-resource languages with non-latin scripts that have a sizeable number of examples in OASST1. The criteria is driven by our objective to close the gap left by the LLaMA family’s difficulty in decoding subwords from these languages. We report pairwise evaluation results for all models: human, gpt-3.5-turbo, wangchanglm-7.5B-sft-en, wangchanglm-7.5B-sft-enth, and another open-source model openthaigpt.

Head-to-head comparison using `gpt-3.5-turbo`as an evaluator

To say that our performance is not comparable to ChatGPT is perhaps Understatement of The Year 2023™️. Our best performance against gpt-3.5-turbo is 26% win rate in Japanese by wangchanglm-7.5B-sft-en. This incidentally highlights the fact that comparing average scores directly such as the “90% of ChatGPT Quality” claim in Vicuna blog might not be a robust approach since by that standard our models are also performing at around 80–85% of gpt-3.5-turbo. We as humans know this is not the case, although one could make an argument that gpt4 used by Vicuna might be a more nuanced scorer.

More encouraging results are the fact that our models score much better when compared to human answers, especially Thai where our win rates are around 60%. This suggests two important points. First, the fact that wangchanglm-7.5B-sft-en performs well even though it is only trained on an English-only dataset could indicate cross-lingual knowledge transfer happening for instruction following. Furthermore, by further finetuning on the same dataset but automatically translated to Thai, the relative performance for Thai to human and gpt-3.5-turbo remains practically the same, whereas the performance of the other two languages deteriorated. This is especially pronounced when comparing wangchanglm-7.5B-sft-enth with gpt-3.5-turbo. It could be that finetuning on the dominant language improves performance across all languages, but further finetuning on a minor language has diminishing returns both for the specific language and other languages in the model.

🧍⚔️👾 Model Preference Does Not Necessarily Translate to Human Preference

We noticed that gpt-3.5-turbo very rarely gives equal scores for a pair of answers. This does not seem like an intuitive way a human would rate these answer pairs. Therefore, we investigated whether models and humans would score these answer pairs in the same manner. We chose to comparewangchanglm-7.5B-sft-en and openthaigpt as Thai is the language where we can source human annotators most easily. We modified and translated the Vicuna evaluation prompt to Thai. To save effort, and since we learn from automatic evaluation that win rates are more important than relative scores, the annotators were instructed to answer if they 1) prefer A 2) prefer B or 3) neutral about both answers.

We performed whole-set evaluation where we gave the entire Thai portion (166 prompts answers) to each annotator to rate. We gave 18 randomly selected annotators a spreadsheet with prompts and two answers labeled as A and B to score. The results across annotators are as followed.

On average, wangchanglm-7.5B-sft-en is preferred 57% of the time, openthaigpt is preferred 10% of the time, and 33% of the time annotators are neutral about the answers. This confirms our suspicion that gpt-3.5-turbo is “too decisive” and does not give neutral as often as it should (0.6% vs 32% by human annotators). That is, gpt-3.5-turbo as an evaluator tends to overstate the win rates for the less performant model; in this case, it gave openthaigpt 43% vs 10% by human annotators. This is bad news for us since this means that our less-than-stellar performance against gpt-3.5-turbo is in fact even worse. At present, it seems that human annotators are still an imperative part of instruction follower evaluation.

Update 2023–05–22: openthaigpt released a finetuned LLaMA shortly after this publication. We also added a crowdsource comparison for that version.

Additionally, we also employed a crowdsourcing platform to evaluate the same 166 examples to calibrate if different styles of human annotation will affect the win rates. The platform randomly assigns the examples to their freelance annotators. An example has 2 to 35 annotations done by any of the 41 annotators (5 annotators per question on average). We select the final annotation for each example as the choice that has the highest “votes”, defaulting to neutral if there is an equal number of preferences for A and B. The result aligns relatively well with whole-set evaluation at 68% for wangchanglm-7.5B-sft-en, 23% neutral and 9% openthaigpt (vs 57%/33%/10%). You can also check qualitatively how we stack against this baseline if you understand Thai, using example prompts and answers from their demo notebook. All evaluation results are also published in our repository under data.

📚What We Learned and Where It Will Take Us

Through these sets of experiments, we have a few leads to pursue our ultimate goal of achieving instruction-following performance parity for non-latin-script, scriptio-continua languages.

Observation #1: Instruction-Following Capabilities Are Stored within the Pretrained Models

As noted by GPT-4 technical report Appendix B, the capabilities of the pretrained model plays a major role in determining its downstream finetuning performance, whether self-supervised or RLHF. Coupled with the recent success of the Alpaca movement, one could speculate that most of the instruction-following capabilities exist within these pretrained models, and instruction-finetuning on much fewer examples (50–500k) simply “unlock” them. Although we have yet understood how instruction-following capabilities emerge from causal language modeling or denoising (“filling in the blank” of many flavors), it seems like we can achieve instruction following by finetuning a sufficiently large architecture, in the tune of 7B parameters and above, pretrained on a suitably large-scale dataset.

For mid-to-low-resource langugages, this poses a daunting challenge: the volume of available training sets are simply not large enough to accommodate the number of parameters required for an instruction-following model; we experimented with several rounds of finetuning XGLM-1.7B, but could not achieve the same capabilities as our XGLM-7.5B version. According to Chinchilla scaling law and similar heuristics, we need roughly 20–25x subword tokens per one parameter. OSCAR (Open Super-large Crawled Aggregated coRpus), one of the largest multilingual pretraining datasets, contain 3.4 TB of English data compared to 181.2 GB in Japanese (19x smaller), 140.8 GB in Vietnamese (24x smaller), and 92.0 GB in Thai (37x smaller).

To give a concrete example, we trained an in-house Thai Sentencepiece tokenizer with 16k vocab and ran it on the Thai portion of OSCAR. This results in about 1B subword tokens; in other words, even with most of the data available on the Internet for Thai, we can only train a 50M-parameter data-optimal model. We would need to make the subword tokens 20x more granular in order for even a 1B-parameter model to be data-optimal, at which point the token will be close to character level and will not perform well at decoding step. At present, there is not enough digitized knowledge base in such mid-to-low-resource languages as Thai to train capable instruction-following models we see in resource-rich languages.

Observation #2: Too-granular Subword Tokens Prevent Models with Good Language Understanding from Generating Good Results

When you try giving Thai instructions to high-performing instruction followers such as the LLaMA family, you might be very disappointed in the random gibberish it seems to return (try it yourself here).

Asking vicuna-13b, a generally very good instruction follower, to suggest ways to save on monthly expenses. It answers some gibberish about how to calculate Wi-Fi usage, online shopping, communication function, and other services.

However, this does not mean that these models are not capable of understanding Thai. In fact, if you add Answer in English at the end of your instruction in Thai (or any non-latin-script language), most models will give a relevant answer in English.

Asking the same question but adding `Answer in English`. The model now answers perfectly.

While we do not know the definite root cause of this behavior, we can hypothesize based on the shape of Thai subword tokens among these models. We compare XGLM, a model we believe have well-shaped subwords to mGPT, which have more granular Thai subwords and also do not exhibit cross lingual knowledge transfer when finetuned with English datasets. We can see that, for the latter models, most of Thai subwords are quite short compared to the overall average. These more granular subwords could present difficulties during the decoding step since the models need to predict many more tokens for them, generally meaning exponentially larger error rates. Perhaps, there exists decoding strategies that mitigate this so well that existing LLaMA-variants can achieve high performance in non-latin-script languages, but the solution does not seem trivial.

We cannot ascertain why it is a common choice among trainers of large causal language models to select very granular subwords for some languages. It could have been done to keep the vocab size manageable such that the embedding parameters (and overall model size) do not get out of hand. However, our work demonstrated that a well-crafted set of subword tokens like XGLM’s, albeit with the tradeoff of a larger vocab size (250k), can lead to more equitable performance across languages. At the time of writing, we imagine many organizations with a lot of resources are pretraining their own foundational models. We ask that they keep in mind such a small step as tokenization can cause a wide, long-lasting rift between people whose native languages have and do not have word and sentence boundaries.

Observation #3: Cross-lingual Knowledge Transfer is Extremely Useful for Mid-to-low-resource Languages

In order to overcome the resource limitation and tokenization issue, we relied on a concept called cross-lingual knowledge transfer. We see the concept in action in various multilingual works such as few-shot learning by XGLM, GPT and T5 family. In our particular case, we were interested in how finetuning the dominant language, English, can transfer some learned language capabilities (reading comprehension, brainstorming and creative writing) to the non-dominant languages. In order to test this idea, we selected a data-optimal pretrained model with a sizeable number of parameters XGLM-7.5B. In accordance with Observation #1, we do not expect the model to perform as well as the LLaMA or Pythia family which are trained on over 2x more data. Nevertheless, the XGLM family has an acceptable number of reasonably-shaped Thai subword tokens (4k out of 250k vocab; 5 characters per token) as opposed to mostly-character-level subwords in other models. This allows us to circumvent the decoding problems in Observation #2. XGLM also has excellent embedding alignment across its 30 languages, thanks to careful pretraining process such as upsampling non-dominant languages during training.

We performed supervised finetuning (SFT) with LoRA first on English-only dataset derived from LAION OIG-chip2, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3. The datasets were selected with reference from various Alpaca-LoRA models out there including but not limited to Alpaca, GPT4All, OpenAssistant, Koala, Vicuna, and Dolly. The training scripts were adapted from lvwerra/trl and HuggingFace. With just finetuning in dominant language (English), the model has already shown promising instruction-following capabilities in Thai and other languages.

Example from evaluation set data/oasst1_wangchang_sft_en_answer.csv:

<human>: ไม่มีเพื่อนเลย เหงามาก ทำอย่างไรดี
<bot>: หากคุณรู้สึกเหงา ลองหากิจกรรมที่ชอบทํา เช่น อ่านหนังสือ, ฟังเพลง, เล่นกีฬา, ทําอาหาร, หรือออกไปท่องเที่ยวกับครอบครัวหรือเพื่อนฝูง จะช่วยให้คุณรู้สึกผ่อนคลายและพบปะผู้คนใหม่ๆ ได้
<human>: 雨が降る仕組みを教えて
<bot>: 降り注ぐ雨は、空気中の水蒸気(水滴)が気化して地上に落ちる現象です。このとき、雨粒の大きさは気温や湿度、風向きなどの条件によって変化します。

Examples wherewangchanglm-7.5B-sft-en switches to English half-way:

<human>: โลกเราหมุน 23 ชม 56 นาที 4 วินาที แต่ทำไมเวลาไม่เคลื่อน
<bot>: จริงๆแล้วเวลาเคลื่อนที่เร็วกว่านั้นนิดหน่อย ประมาณ 1/100th of a second per day, but that’s still a lot faster than the speed of light, which is about 300,000 miles per second.
<human>: 私は猫ちゃんが好きなんですけど貴方も猫ちゃんが好きですか?
<bot>: はい, I’m a robot. I don’t know if you’re a human or a cat. But I can tell you that I love cats. They are so cute and cuddly.

We still find instances where the model starts of generating Thai tokens but revert back to English mid-way to convey a relevant answer. We hypothesize that this originates from misalignment in cross-lingual embeddings (Thai tokens vs English tokens), so we tried to alleviate the issue by further finetuning the model on the dataset translated with Google Translate API. However, as seen from the evaluation above, this did not substantially improve performance for Thai and also deteriorated performance for other languages. This could be either due to minor language finetuning not having enough impact or the fact that the quality of the translated dataset is not high enough to be useful.

🤔 Decisions, Decisions

So far, we have learned that large language models that are pretrained in with enough data contain the capabilities to become instruction followers. Granularity of subword tokens do not hinder language understanding but may limit the model’s ability to generate texts for languages whose subwords are too granular. On the other hand, if the subword tokens are well-shaped, we might be able to observe cross-lingual knowledge transfer in the similar manner to other zero-shot tasks. Since our goal is to create an instruction follower that is on par with the LLaMA/Alpaca family, our choices are:

Pretrain our own language model with careful subword tokenization. This is the most straightforward choice and also the most resource-intensive one.
Further pretraining existing language models with well-shaped subwords, hoping that it eventually achieves the same level of capabilities as the LLaMA family. In our case, it would likely be further pretraining XGLM. This is a relatively less resource-intensive choice, but it comes with the complication of how to further finetune in such a way that the pretrained weights are not too disturbed. It also hinges on the strong assumption that XGLM can eventually be on par with LLaMA with further pretraining.
Revamp LLaMA subword tokens. This is the least resource-intensive choice but the riskiest one in terms of performance. There have been some reported success in replacing subword tokens of a pretrained language but we are mostly still in the dark on how successful it could be for non-latin scripts.

PyThaiNLP and AIReserach.in.th will further investigate these directions and we hope to come up with a model on par with Alpaca family by the end of 2023.

Environmental Impact

Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.432 kgCO2eq/kWh. A cumulative of 500 hours of computation was performed on hardware of type Tesla V100-SXM2–32GB (TDP of 300W). Total emissions are estimated to be 64.8 CO2eq of which 0 percents were directly offset. Estimations were conducted using the MachineLearning Impact Calculator.

Acknowledgements

We would like to thank Huggingface for the open-source infrastructure and ecosystem they have built, especially lvwerra of the trl repository. We give our appreciation to the open-source finetuning pioneers that came before us including but not limited to Alpaca, Alpaca-LoRA, GPT4All, OpenAssistant, Koala, Vicuna, and Dolly.

Citation

@software{charin_polpanumas_2023_7878101,
  author       = {Charin Polpanumas and
                  Wannaphong Phatthiyaphaibun and
                  Patomporn Payoungkhamdee and
                  Peerat Limkonchotiwat and
                  Lalita Lowphansirikul and
                  Can Udomcharoenchaikit and
                  Titipat Achakulwisut and
                  Ekapol Chuangsuwanich and
                  Sarana Nutanong},
  title        = {{WangChanGLM🐘 — The Multilingual Instruction- 
                   Following Model}},
  month        = apr,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.1},
  doi          = {10.5281/zenodo.7878101},
  url          = {https://doi.org/10.5281/zenodo.7878101}
}

Appendix

1. Generation Hyperparameters for Our Models

See script/generate_huggingface_answer.py.

def infer_answer(prompt):
    input_text = f"<human>: {prompt} <bot>: "
    batch = tokenizer(input_text, return_tensors='pt')
    output_tokens = model.generate(**batch, 
                                   no_repeat_ngram_size=2,
                                   num_beams=5,
                                   min_length=len(batch['input_ids'])+64,
                                   max_new_tokens=512,
                                  )
    return tokenizer.decode(output_tokens[0], skip_special_tokens=True)

2. Generation Hyperparameters for `gpt-3.5-turbo`

See script/generate_openai_answer.py.

def get_openai_answer(prompt, model_name = script_args.model_name):
    response = openai.ChatCompletion.create(
      model=model_name,
      messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
      ]
    )
    return response['choices'][0]['message']['content']

3. Generation Hyperparameters for openthaigpt shared by the owner @kobkrit

See script/generate_opengpt_answer.py.

answer = openthaigpt.generate(instruction=row['prompt'], 
    input="",
    model_name = "kobkrit/openthaigpt-0.1.0-alpha",
    min_length=50,
    max_length=768,
    top_k=20,
    num_beams=5,
    no_repeat_ngram_size=20,
    temperature=1,
    early_stopping=True
)

4. Instruction to `gpt-3.5-turbo` for Evaluation

See script/eval_vicuna_style.py. The prompt is adapted from Vicuna.

{
  "prompt_id": 1, 
  "system_prompt": "You are a helpful and precise assistant for checking the quality of the answer.", 
  "prompt_template": "[Question]\n{question}\n\n[The Start of Assistant 1's Answer]\n{answer_1}\n\n[The End of Assistant 1's Answer]\n\n[The Start of Assistant 2's Answer]\n{answer_2}\n\n[The End of Assistant 2's Answer]\n\n[System]\n{prompt}\n\n", 
  "defaults": {
     "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 0 to 1, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
  }, 
  "description": "Prompt for general questions", 
  "category": "general"
}

5. Instruction to Annotators for Whole-Set Evaluation

ในฐานะผู้ใช้ภาษาไทย เราอยากให้คุณประเมินคำตอบของผู้ช่วยปัญญาประดิษฐ์ 2 อันที่ตอบ
คำสั่ง 166 ตัวอย่างว่า A ทำได้ดีกว่า, B ทำได้ดีกว่า, หรือทั้งสองอันทำได้ดี/แย่พอๆกัน
เกณฑ์ในการประเมินคือ 1) ความมีประโยชน์ 2) ความเกี่ยวข้องกับคำสั่ง 3) ความถูกต้อง
และ 4) รายละเอียดของคำตอบ

6. XGLM Pretraining Dataset

As you can see, Thai is heavily upsampled in the pretraining set of XGLM. This could be one of the reason that our models perform better in Thai compared to Japanese (which was downsampled) and Vietnamese (which has substantially smaller volume).

7. Pronunciation

WangChanGLM is pronounced either as Wang-Chan-G-L-M (วังจันทร์จีแอลเอ็ม; meaning GLM of Wangchan, the location of VISTEC which is also where our GPU facility resides) or Wang-Chang-L-M (วังช้างแอลเอ็ม; meaning Elephant Palace LM, as Wangchan is famous for its elephants).