Transliteration in JavaScript

Daniel W. Hieber
Digital Linguistics
5 min readFeb 1, 2018

Introducing Transliteration

A common chore for linguists is transliteration, i.e. converting from one orthography (a set of conventions for writing a particular language) to another.

For instance, linguists tend to use a simplified or practical orthography in their notes and databases rather than the International Phonetic Alphabet (IPA), and then convert their data to IPA or a common Americanist notation for publication. Moreover, linguists often work with communities that have their own preferred orthography, which may or may not be the same as the linguist’s practical orthography. Producing materials for the community requires transliterating from the practical orthography to the community orthography. And it is common for linguists to work with data from archival sources and prior researchers, each of whom may have used a different orthography to write the language.

My own Chitimacha database (isolate, Louisiana; ISO 639–3: ctm, Glottolog: chit1248) uses no less than 8 orthographies, many of which comes from previous researchers:

Modern Orthographies

  1. Modern community orthography
  2. International Phonetic Alphabet
  3. Americanist Notation

Historical Orthographies

  1. Morris Swadeh’s phonetic notation (1930–1934)
  2. Morris Swadesh’s phonemic notation (1930–1952)
  3. John R. Swanton’s notation (1907–1920)
  4. Albert S. Gatschet’s notation (1881–1882)
  5. Martin Duralde’s notation (1802)

In the past, this meant that I was constantly transliterating one or the other historical orthography to one or the other modern orthography — by hand. Often the same example phrase would need to be transliterated multiple times, depending on the particular publication outlet (e.g. Americanist notation for the International Journal of American Linguistics but IPA for more general journals). This was obviously error-prone and time-consuming.

Basic Transliteration

At its most basic, transliteration is a simple procedure that makes a series of substitutions on a string. For example, to transliterate the written word cat to IPA, we need to make two substitutions: <c> → <k> and <a> → <æ>.

Note that this method of doing transliteration is unidirectional. If we wanted to transliterate back from IPA to English, we would need to make a different set of substitutions: <k> → <c> and <æ> → <a>. Sometimes it is possible to do bidirectional or multidirectional transliteration when every grapheme (written symbol) in an orthography maps to one and only one grapheme in another orthography. However, bidirectional transliteration is frequently not possible given the complexities of different writing systems. Certain distinctions can be lost in the process of transliteration that make it impossible to reverse the process without losing information. Thus the DLx transliterate() algorithm is unidirectional.

With this in mind, let’s set up our function to accept two arguments: string (the String to transliterate) and substitutions (a list of substitutions to make on the String). We can call the set of substitutions a transliteration scheme, that is, the set of substitution rules for transliterating from one orthography to another. The substitutions argument should be an Object whose keys are the graphemes to replace, and whose values are the graphemes to replace them with. For example, the transliteration scheme for our cat example above would look like this:

const substitutions = {
a: 'æ',
c: 'k',
};

Here’s how we’ll want to use our transliterate() method when it’s complete:

const ipa = transliterate('cat', substitutions);
console.log(ipa); // --> "kæt"

Inside of the transliterate() method, we’ll need to perform each one of the substitutions in the substitutions argument using Regular Expressions. Here’s how we might do that:

Running this function on our cat example gives us the correct output, /kæt/— we’ve successfully transliterated a string! Now let’s look at some edge cases that might break our function, and how to address them.

Edge Case 1: Substrings

Let’s say we have the following transliteration scheme:

{
s: 'x',
ts: 'c',
}

If we run our curent version of the transliterate() function on the string "tsaste" (‘touch’ in Chitimacha), we get "txaxte" as the output — the wrong result! According to our transliteration scheme, what we want the output to be is "caxte".

What happened? The problem is that the input "s" is a substring of the input "ts". When our function makes the first substitution of "s" → "x", the string "ts" is changed to "tx". When our function attempts to make the second substitution ("ts" → "c"), it fails because no sequence of "ts" can be found in the string. Every instance of "ts" has already been changed to "tx".

To fix this problem, we simply need to order the substitutions from longest string to shortest string before making the substitutions. You can verify this for yourself manually by running transliterate() on the transliteration scheme below. The substitution rules are the same as before, but their order is different. This time we get the result we want.

{
ts: 'c',
s: 'x',
}

To sort the substitution rules programmatically, we simply run the .sort() method on our list of substitutions before calling the .forEach() method, like so (note the added .sort() method):

Edge Case 2: Feeding

For the second edge case, consider what happens if we run our transliterate() function using the following transliteration scheme on the string "chaca".

{
ch: 'c',
c: 'x',
}

Using our current version of the transliterate() method, we get the wrong result: "xaxa" instead of the expected "caxa".

The issue here is what’s known as a feeding problem in linguistics (particularly the field of phonology), when the output of one rule/substitution is also the input to another rule/substitution. In the example above, our function first changes all instances of "ch" to "c", and then changes all instances of "c" to "x", leaving no instances of either "ch" or "c" in the final output.

To fix this, we first check the list of substitutions for feeding problems, and then break the feeding chain by temporarily swapping the output of the offending rule with some other random character. Then, once the substitutions are made using the random character, we swap the correct character back in. For the random temporary character substitution, we’ll use a Unicode character from the geometric shapes block, since it is unlikely that anyone would use characters from this code block in an orthography (and would be extremely bad practice besides).

Here is the final code for the transliterate() function, which both sorts the substitutions and handles feeding problems correctly. Notice that we now have a temps Object, for storing our temporary feeding-problem substitutions, a getRandomCodePoint() method for selecting a random Unicode point in the geometric shapes block, and several new steps in the code where we make the temporary substitution and then undo it later.

And that’s it! The transliterate() function should now give the correct result for the feeding problem above.

Find any other edge cases that this algorithm doesn’t handle? Please let me know in the comments!

Conclusion

Transliteration turns out to be slightly more complicated than it seems at first glance. However, dealing with edge cases like substrings and feeding is actually quite simple. The result is a function that is both compact and versatile, capable of transliterating from any orthography to another.

--

--