This is an adaptation of my talk to PeninsulaJS in Mountain View last month. Their next meeting is coming up on February 24th!
မင်္ဂလာပါ (mingalaba) and welcome — I’m a web developer and mapmaker at The Asia Foundation. We’re a non-profit working in countries across Asia on international development and humanitarian issues. This sent me to Myanmar in summer 2015 and led me to learn more about Myanmar/Burmese language.
We were in Myanmar to collect candidate information and post it online. We helped local developers make several election apps.
Looking at the above photo, I can understand what’s happening, but I cannot make head or tail of the signs… I don’t even know when one letter ends and another begins. I start out completely illiterate.
Then at the end I am going to talk about: is this a good idea? This is one of the best things about side projects — instead of focusing on useful and productive things, you can hack something together and see what happens.
Before this talk, I figured I should research a fun fact on crosswords. Little did I know..! Only 10 days after Pearl Harbor, an internal memo at the New York Times recommended that given the dark times and blackout hours ahead, the paper should publish a fun crossword puzzle in the Sunday edition. They’ve been printing them since 1942.
Anyway, let’s revisit how crosswords work with our “programmer’s mind”. There is a two-dimensional grid. There are words going across and down. It’s okay to start an across and down in the same place, or in the middle of another word. You can’t put two words next to each other unless the adjacent squares also are valid words. And crosswords are case-insensitive… we’ll talk about that more later.
I’m going to draw the puzzle using the HTML5 <canvas> element. Fortunately this is one of the easiest things to draw… it should be in all the tutorials.
Essentially I paint the canvas black, and then have a two-dimensional array representing the grid. On the array, if there’s a letter, I paint a white square, 36x36 pixels. Then if it’s the start of a word, I print the number label over the white square.
My first plan was: add words sequentially from the top left, alternating across and down. As long as I avoid invalid combinations, I’ll be fine!
But none of the words cross! It turns out, it’s easy to find space for words, until you can’t. As an example, I could put several words across but then it’d be extremely hard to find words that can be written down through all of them.
I also found it difficult to avoid putting words next to each other in my first algorithm… 2 and 4 are illegally placed.
My second idea was to attempt randomly placing words until finding a valid position… these puzzles were not even worth a screenshot. Making something random for the sake of random is a bad idea. Trying to solve a problem with randomization is saying “I understand this problem less than random chance” and it’s almost always wrong.
This is 95% of how I do the puzzle now… Like Scrabble, I look at previous words and actively attempt to find overlap. I randomize the array of previous words so that they don’t all overlap off of the first or last one.
I recommend starting with the longest words first, before the smaller, more flexible ones.
Checking the ends: when you place a word, I don’t just need to make sure there are empty squares. I need to check squares before and after the word, to make sure that the new word doesn’t appear as a prefix or suffix to an existing word.
Checking the sides: for each letter that I place, I need to check that it’s not causing problems for existing words. If a square already exists, I know that the square is valid, I just need to make sure that the letter matches. When I’m placing a new square, as in this diagram going down, I check left and right to make sure that I’m not adding adjacent to an existing word.
Let’s make this crossword Unicode-friendly. When I started, I assigned a number n based on words.length — how many words existed. Then I wrote a function (represented here as i18n_function) to replace the number with Burmese and other numerals.
But I forgot about the case where (for example) “2 across” and “2 down” start in the same square. So I check for that, and set a shared_start variable. If that’s been set, I reuse it. Otherwise I run i18n_function and increment the variable n, which is independent of the word count.
OK, so far I’ve been making this for the browser. Time to make the jump into NodeJS.
The first step is making the code more modular. I use Function.prototype to make a Crossword object, and put all of my setup code inside there. The function needs to receive a <canvas> element and the dimensions of the grid to paint it.
I also need to separate out my browser-specific code (jQuery and UI interactions) into a separate file. Adding a word is interesting — instead of reading the web form directly, I use a new addWord() function. The game logic decides on its own where this word fits (e.g. 3 Down), so I need a callback function to receive that number and direction.
But how do I make <canvas> work in NodeJS? There’s a module that does this really well. You need to have cairo installed, and this requires a special buildpacks on PaaS systems such as Heroku. But if you can get it installed, then you can create a new Canvas object and treat it like a client-side <canvas>.
Now it’s time to get to the core of this project by making the jump into Unicode.
This section of the presentation could be called, “What We Talk About When We Talk About Unicode.” You’re always hearing “did you know Chinese has thousands of characters?” or “aren’t Emojis cute?” I’m going to talk about the dark side of Unicode that we don’t like to think about.
For some reason, “the dark side of Unicode” seems to include Brahmic scripts, represented in this map as the yellow and orange areas. Over 1.5 billion people live in these areas and their scripts are different, but they all follow a similar pattern influenced by Sanskrit. Even as a kid I heard a little how Chinese symbols work, but I knew nothing about these scripts until I went to Nepal for the first time. It’s easier for me to explain these rules in Burmese, so I’ll attempt to do that now.
Burmese is a syllable alphabet where your goal is to make combos.
The first character on the bottom မ is the “m” sound and by itself it gets the default vowel “mah”. Then you add a vowel sign which I call left bracket (guaranteed programmer laugh line) and combined they make မြ or “myah”. That’s one syllable and in my crossword it would be one block.
The S-shaped character န is “na”. There is a little c you can put on top ် which is actually called a “vowel killer” and takes away the default vowel to make it just “n”. Together that’s another block.
Then we reuse the “mah” base character မ but add a different curly sign ာ which makes an “aw” sound. In English we end Myanmar with an “r” but it’s a little more “w”-ish.
If you’re lost, don’t worry. The most important thing is you can see မြ and မာ share the base character မ but then get different vowels.
That’s a heavily simplified explanation and there are things that I don’t know… The syllable “min” is written as two letters မင် and I’m not sure why. Then in some words, especially those from the Pali people, you can stack letters on top of each other. You can stack any two letters and their vowel signs by adding a special invisible character which you see here.
I call it plus box.
There are a lot of potential combinations which can make up one “block” for my crossword. We need to find a way to programmatically divide all of these strings.
At first I had a crazy regular expression combining Burmese and RegEx characters. Then I found a Node module called regenerate, and you can feed several different characters into it. That helped me compile a monster RegEx which I can’t read, but it works really well.
Nice! I made a crossword puzzle and sent it to some friends.
Now it’s time to ask if this is a good idea. Earlier, I mentioned that crosswords are case-insensitive. It makes sense because if we have uppercase and lowercase letters, you go from 26 to 52 possible letters. Logically this makes the puzzle more difficult, but it doesn’t make it twice as difficult. Capital letters don’t happen 50% of the time; they are only used in proper nouns and acronyms.
So how many combinations are there in Burmese?
In the Burmese-language Wikipedia article about Myanmar, there are 54 combinations for just the first letter က (“ka”). Holy crap!
This could be as difficult as a Chinese crossword. But even with my limited vocabulary, I was able to make a puzzle. What’s going on?
It turns out that 1/3 of the time that you see က, it’s with the vowel killer (just a “k”). Another 1/3 of the time it’s one of these three: (“ka”, “ko”, or “kya”). The others are so rare… the final 12 combined are only half-a-percent of ကs in the article.
If there’s time I’ll talk about one more thing. There’s a “Myanmar Unicode Area” group on Facebook where all of the cool kids hang out. You see a lot of posts like this… it says language, it says help, it says “bro”..?
You can see that there’s something wrong. Each of these perforated circles is a vowel sign without a base character. There’s a plus box visible by itself. Also there’s some combinations which don’t make sense.
The font in many phones and computers is Zawgyi. Despite the idealism of Unicode, there are tons of languages which prefer to use pre-existing fonts and non-Unicode-compliant encodings. The unique thing about Zawgyi font is that it inhabits the same codepoints as Unicode’s Burmese area. So you can’t tell the difference byte-by-byte between Zawgyi and Unicode-compliant text. It’s as if the weird Dvorak-typer in your office used a Dvorak font instead of a keyboard.
There are RegExs which look for invalid combinations, but their accuracy depend on the length of the text. Facebook is putting together a dictionary, Unicode people are talking about standardizing and marking Zawgyi as an encoding… it’s hard.
Now there’s something about Zawgyi which I didn’t believe was possible, but this is a real problem. If you remember left bracket, you can see in the second-to-last row of that char chart there are five different shapes depending on what letters you wrap around. In the row above that there are another two. In Unicode fonts, you have just one left-bracket character, and typographers are responsible for supporting combinations. So how does Zawgyi make room for seven left bracket codepoints, and all this other duplication?
Incredibly, Zawgyi font overwrites and reappropriates characters used by the minority languages of Myanmar to make room for the majority Burmese. There are hundreds of ethnic groups in Myanmar, as depicted on this map. Their languages and loan words include some additional letters, vowel signs, and numerals. In my cultural context this map seems to promote stereotypes, but it’s a common way to depict diversity in Myanmar.
This causes two problems. The obvious one is that minority languages cannot be displayed in Zawgyi font. The other is that Unicode converters assume that these characters are Zawgyi text and “fix” them, harming minority-language texts in Unicode, too.
There are several Facebook groups and GitHub orgs and accessories like this keyboard sticker set, for detecting and converting formats in different programming languages. In any case, I’ve allowed Zawgyi input (on the web side) and send only Unicode characters into the crossword puzzle generator.
Once that’s all sorted out, all that was left to do was npm publish
Once you run npm install crossword you can use the module or a command-line tool. The documentation is on GitHub: https://github.com/mapmeld/crossword-unicode