Thai Spelling Correction: a Gentle Intro and a Quick Recipe! 🥘

Anuruth Lertpiya (BobbyL2k)
KBTG Life
Published in
7 min readOct 19, 2023
Photo by Stefan C. Asafti on Unsplash

Fresh from the oven of our latest research, “How to Progressively Build Thai Spelling Correction Systems?” [1], we’re here to give you a glimpse into the world of Thai spelling correction. Let’s dive into a simple approach that’s straightforward, yet remarkably effective.

The Kitchen: Gentle Introduction 🍽️

First off, let’s demystify what Thai automatic spelling correction is all about. Picture it as a safety net for your writing — it spots those tiny blunders in spelling and gives them a little nudge in the right direction. We aren’t aiming to rewrite your novel here; we just want to polish those rough edges. And doing so automatically, hands free.

Example of Thai text “This restaurant is very delicious” being corrected

From our research [2], we gathered that Thai spelling correction can be a tricky business. The essence of spelling correction revolves around generating sequences, the possible answers are massive. The technical among us would call it exponential. And thus comes the catch: mispredictions can make things go awry quickly. Since there aren’t that many errors to begin with. So, any added error is another step backward.

Enter dictionaries. Our trusty old tools for dictionary-based spelling correction. As reliable as they are, sometimes they can be a tad outdated. They might not recognize the latest slangs and buzzwords or specific names, leading to two main hiccups:

  1. They might wrongly label some words as errors because they’re not in the repertoire.
  2. If they spot an error, but the correct word isn’t in their database, they replace the mistake with another.

And if the dictionary tokenizer isn’t adept at catching mistakes? Well, it just compounds the problem.

But here’s the silver lining: we’ve found a neat workaround for these challenges. And the best part? It’s simpler than you’d imagine! 🌟

The Secret Recipe: Minimal XNCC 🍲

You know when you’re whipping up a delicious dish and realize that a simple, overlooked ingredient can elevate it to the next level? In the world of Thai spelling correction, it’s like we’ve been overlooking a fabulous ingredient that’s been on our kitchen counter all along!

Think about it: we’re in the business of fixing misspellings, right? Why not learn directly from those misspellings? It’s like adding a dash of that secret ingredient to your favorite dish. It’s been there all along, just waiting to be noticed!

So, this is our special little twist. Instead of leaning on heuristics (edit distances, phonics representation, etc.) that have been used and reused, we propose taking a fresh approach. We’re going to glean insights directly from the data, this will be like the minimal version of Extendable Neural Contextual Corrector (XNCC). And if you’re curious about how it stacks up against the full version of XNCC [1], take a peek at Figure 1.

Figure 1: The Full Version of XNCC vs. the Minimal Version in This Article

Hold onto your apron strings because there’s more! We cannot just present a method without testing it. Let’s see the data: the Thai UGWC Public split [3]. Got a thing for details? We got you covered.

Table 1: Overview of the Thai UGWC Public Split [3]

First up on our cooking adventure, we sift through our training data to pick up on the trends: spotting the tokens that hit the mark and those that veer off-course.

Getting Our Hands Dirty: The Code 🥣

Time to roll up our sleeves and delve into the actual recipe! Our first order of business is mixing together our ingredients, ensuring our code captures the essence of that overlooked “secret ingredient” — the misspellings.

# Time to gather insights from our treasure trove: the data!
cor_map = defaultdict(Counter)

for _, src_tokens, cors in data_split:
# Sneak peek of our data from `data_split`
# src_tokens = ['ช่วย', 'ดู', 'ที', 'ครับ', 'จ่าย', 'ไม่', 'ใด้', 'มา', 'จะ', 'สอง', 'อาทิต', ... ]
# cors = [[[6, 7], ['ได้'], 'misspelled'], [[10, 11], ['อาทิตย์'], 'misspelled']]
last = 0
for (begin, end), cor_toks, _ in cors:
for tok in src_tokens[last: begin]:
# Tallying up the correct tokens
cor_map[tok][tok] += 1
last = end

if begin +1 == end and len(cor_toks) == 1:
# Counting those sneaky tokens that need correcting
cor_map[src_tokens[begin]][cor_toks[0]] += 1

# We now pinpoint the most frequent corrections for each token
high_freq_cor_map = {}
for src, cors in cor_map.items():
max_count = 0
best_cor = None
for cor, cor_count in cors.items():
if cor_count > max_count:
max_count = cor_count
best_cor = cor
high_freq_cor_map[src] = best_cor

With our blend of tokens ready, let’s prep our trusty Trie tokenizer — or our handy mixer in this analogy. This handy tokenizer code is released at the end of the article, but you can use your favorite flavor of optimizing dictionary-based tokenizers. This is the key️ here, we will also be adding the thieving tokens 🌶️ into the tokenizer. Spicy 🔥

from klabs_nlp.trie import Trie
# Filling it up with our tokens, both the nifty ones and the mischievous misspelled ones!
trie = Trie(high_freq_cor_map.keys())
```

Last but not least, let’s put the icing on the cake. When it's go-time, we'll sift through the text, seasoning it with our learned corrections.

```Python
# `src_line` - our yet-to-be-perfected text
src_toks = trie.attempt_maximal_tokenization_str(src_line)
# Sprinkle in the corrections, or stick with the original if we're already golden
cor_toks = [high_freq_cor_map.get(tok, tok) for tok in src_toks]

Now that we’ve got our code dough prepped, we’re all set for the oven (or in our case, the results)

Fresh Out of the Oven: The Results 🥖

Baking is always filled with anticipation, and the moment has come to pull our creation out of the oven and see how it’s turned out.

In our recipe of the minimal XNCC, we’ve tested two variations: one that uses only the training-split and another that adds a dash of the development-split. Just to put some zest into our comparisons, we’ve lined it up against some stalwarts like the full version of XNCC [1] and 2-Stage Ctx-Att [2, 3]. Because every chef loves their metrics, we’re also presenting our Word-error-rate (WER) and character-error-rate (CER) results.

Table 2: End-to-end evaluation of automatic spelling correction systems on the public UGWC dataset.

* The scores for XNCC come from a different recipe book, yet our minimalist version is much in alignment with their findings
** The scores for 2-Stage Ctx-Att from [3], but they used a slightly different tokenizer for evaluation on the public UGWC dataset

The Aftertaste: Concluding Notes 🍷

With the aroma of our results filling the room, it’s time to reflect. Thai spelling correction, just like mastering a complex dish, has its challenges. But with the right ingredients and techniques, simplicity can bring out great flavors even with little effort.

Our minimalist rendition of the XNCC [1] is a testament to this belief. A small twist in the method, by utilizing misspellings as a hidden ingredient, can make a considerable difference. And while our method may be simple, its potential implications for Thai spelling correction are substantial. It offers both implementers and researchers a delightful entry into the culinary world of Thai spelling correction, an essential tool for language processing.

Now, for those who’ve tasted this dish and are craving more, the more intricate recipes and detailed methodologies await in our publication [1]. Cheers to the pursuit of linguistic (and culinary) perfection!

The Spice Rack: References 🌿📚

[1] — A. Lertpiya, T. Chalothorn and P. Buabthong, “How to Progressively Build Thai Spelling Correction Systems?,” in IEEE Access, vol. 11, pp. 72704–72716, 2023, doi: 10.1109/ACCESS.2023.3295004.

[2] — A. Lertpiya, T. Chalothorn and E. Chuangsuwanich, “Thai Spelling Correction and Word Normalization on Social Text Using a Two-Stage Pipeline With Neural Contextual Attention,” in IEEE Access, vol. 8, pp. 133403–133419, 2020, doi: 10.1109/ACCESS.2020.3010828.

[3] — A. Lertpiya, “Thai Spelling Correction and Word Normalization on Social Text Using a Two-Stage Pipeline With Neural Contextual Attention,” in Thesis (M.Eng.) — Chulalongkorn University, 2019, uri: https://cuir.car.chula.ac.th/handle/123456789/70327.

Compliments to the chef?: Citations 👨‍🍳

Every chef values the appreciation received for their culinary creations. If you’ve savored our offering and found it enriching, it would warm our hearts if you acknowledged our work in your academic feasts.

A. Lertpiya, T. Chalothorn and P. Buabthong, “How to Progressively Build Thai Spelling Correction Systems?,” in IEEE Access, vol. 11, pp. 72704–72716, 2023, doi: 10.1109/ACCESS.2023.3295004.

@ARTICLE{10181311,
author={Lertpiya, Anuruth and Chalothorn, Tawunrat and Buabthong, Pakpoom},
journal={IEEE Access},
title={How to Progressively Build Thai Spelling Correction Systems?},
year={2023},
volume={11},
number={},
pages={72704-72716},
doi={10.1109/ACCESS.2023.3295004}
}

With this, our short but delightful culinary adventure through the world of Thai spelling correction comes to an end. We hope the flavors have inspired you, and that you’ll continue to explore, innovate, and share in this realm. Until our next gourmet rendezvous! 🥂📚

Trie Tokenizer

Follow KBTG Life for more stories like this. We have great articles both in Thai and English that are carefully crafted by KBTG people, so don’t miss out!

--

--