Minglish: Devanagari Transliteration App - Part 1
Around 4 Years back, one weekend night I felt that my chatting experience on my android device sucked. Main reason for that was, I always chatted with friends/family in Marathi (my native language that uses Devanagari Script) but since there was no Marathi keyboard, I had to type everything in English. There were multiple Marathi Keyboard options available on Play Store, but those were incredibly difficult to use. Marathi, or to say Devanagari keyboards in general, have too many buttons. Unlike QWERTY keyboard, it takes a while to get used to it. Also, we often use English words in sentences. No one calls “Internet” as “महाजाल”. One needed to keep switching between English and Devanagari keyboards. I started wondering what would it take to build a Devanagari Transliteration Keyboard?
So in middle of a night, I felt this adrenaline rush. I opened up my laptop and started googling. There were a lot things I had no knowledge about. e.g. How Unicode characters worked? How digital fonts come into this picture? What is UTF-8 and UTF-16? How existing Devanagari Support worked on computers? Did android support Devanagari at that time? Is this even feasible?
Within few hours, I could get a really basic “Hello World” kind of Devanagari Transliteration working. In next 2 days, me and my roommate - Kedar hacked a sample Android Keyboard app and improved and integrated this algorithm. Voilà, Minglish Android app was born.
In this blog series, I will talk about how Devanagari Transliteration works. Lets start with few basic concepts.
How Devanagari Works on Digital Devices?
Unicode Standard supported Devanagari since early 2000 probably. If you inspect a Devanagari character on free online Unicode inspector, you can see below information. Such nice tools did not exist a few years back, but one just needed to gather below kind of information.
Devanagari character “अ” is represented with 0x0905 value. Simply taking that hex value of character should print “अ” on the console.
But depending on your Font of console, you may see only “?” characters or “अ”. Font needs to support visual representation of “अ”. As long as you generate correct character byte value and have a font that supports Devanagari, you are all set.
How “जोडाक्षरे” works?
“जोडाक्षरे” in Devanagari means characters that are formed by joining multiple Consonants and/or Vowels.
How do you form, say a “पो” (Pronounced and written in English as Po)? Is there a separate character for that? No.
“पो” = “प” + “ ो”
“प” in Unicode is 0x092A
“ ो” in Unicode is 0x094B. It is called Dependent vowel signs in Unicode Spec.
So if we put these 2 characters next to each other, we get a “पो”. The font takes care of making “प” + “ ो” look like a “पो”.
System.out.print(((char)(0x092A) + "" + (char)(0x094B)));
How complex “जोडाक्षरे” works?
Above approach works in consonant + vowel cases. But when you are joining a consonant with another one, we need a little magic.
For getting a “नम”
न is Unicode is 0x0928
म is Unicode is 0x092E
So you can do something like below -
System.out.print(((char)(0x0928) + "" + (char)(0x092E)));
But then how do you get “न्म” ?
There is a little nice magical character “ ्”. In the spec, its 0x094D and called VIRAMA. (Halant the preferred Hindi name)
System.out.print(((char)(0x0928) + "" + (char)(0x094D) + "" + (char)(0x092E)));
So far we learned that how to display a Devanagari word correctly. If we generate a correct byte sequence with various Unicode character values, Font takes care of displaying it correctly.
Now, in next blog, we can talk about the bigger problem.
How typing “nilesh” can be correctly converted into निलेश ? Or basically
nilesh => 0x