DNS emojis using punycode

In this post we’re going to create a DNS name containing emojis.
Why? Well because it’s cool and we’re going to learn a lot! But mainly because it’s fun — and totally useless.

This post will cover:

  1. Code points
  2. Encoding UTF-8 code points to binary
  3. Internationalizing Domain Names in Applications (IDNA)
  4. Punycode

Prerequisite:

  • You will need libidn, libdn2 or python3 installed on your computer

Emojis — like the ones on your favorite smartphone — use unicode code points which are encoded in UTF-8 and transmitted in binary from devices to devices.

In order to encode a character, we need to map them to a certain value. This is called Code Points. Because they point a certain value to a character. For example the famous smiley face emoji ‘😀’ is U+1F600 which has the decimal code point value of 128512.
Another example in which you could be more familiar with, is ASCII. American Standard Code for Information Interchange encodes 128 characters in 7 bits where A is code point 0x41 or 65. Obvious limitations of ASCII led to other designs like UTF-8.

The encoding process of UTF-8 is completely different from ASCII. The difference resides in the way of converting the code point value to binary. In ASCII the code point value is the binary value. For instance: A is 65 which is then 001000001.
In UTF-8 we have to do more work. A single character can take between 1 and 4 bytes. The tricky part being the code point value: is it not the binary value.

To understand what the encoding process is, let’s go back to our smiley face example.

If we want to encode our smily face to binary here’s how we would do it:

  1. Lookup in the table in [1] and find where your unicode character falls in.
  2. The table says our unicode character needs to be 4 bytes long and its code point value needs to be 21 bits long.
  3. So we start by converting the code point value. Here 1F600 becomes: 0001 1111 0110 0000 0000.
  4. Add as many leading 0 as you want in order to match the number of required bits. The bit string is going to be 000011111011000000000 — we had 20 bits so we have to add one 0.
  5. Each UTF-8 byte has a “template”. Let’s take the first byte’s template in the table: 11110xxx. You have to replace xs with bits from your bit string.
  6. 1st byte has to start with 11110 so we grab the first 3 bits and make it up a byte: 11110000
  7. 2nd byte has to start with 10 so we grab the following 6 bits and make it up a byte: 1001111
  8. 3rd byte has to start with 10 so we grab the following 6 bits and make it up a byte: 10011000
  9. 4th and last byte has to start with 10 so we grab the last 6 bits and make it up a byte: 10000000
  10. Final step — optional — is to convert everything back to hexadecimal: 0xF0 0x9F 0x98 0x80 for pretty output.

That wasn’t so hard, wasn’t it? Wait a second! It’s not quite done. We now have 4 bytes for only one character. If we send 0xF09F9880 through the wire to a DNS server it is going to decode it using ASCII and then think it is 4 characters long. Not good.

So, we have 2 options:

  1. Change the default character decoding to UTF-8.
  2. Convert UTF-8 to ASCII
Sometimes, you can’t

Option 1. seems good because, more and more programs are by default using UTF-8 in order to keep compatibility between them. But doing so would require to upgrade every DNS server.. at the same.. time. Along with every DNS client too. You could also add a flag to specify the what the encoding is going to be, but what if you awesome emojis name gets an answered by an old DNS server?

Well you probably thought it was a bad idea. We’re going to look at option 2. Converting UTF-8 to ASCII would be the answer, since no infrastructure change would be required! The only thing that could be nice is to have the capability to go back from ASCII to UTF-8 on the client to display our emojis 👍.

Hopefully some super smart people already answered the question. RFC5891 talks about this:

IDNA works by allowing applications to use certain ASCII
 string labels (beginning with a special prefix) to represent
 non-ASCII name labels. Lower-layer protocols need not be aware of
 this; therefore, IDNA does not change any infrastructure. In
 particular, IDNA does not depend on any changes to DNS servers,
 resolvers, or DNS protocol elements, because the ASCII name service
 provided by the existing DNS can be used for IDNA.

That seems like what we want!

Enter Punycode

Punycode is the way of converting back and forth ASCII and UTF-8. It works by applying ToASCII or ToUnicode functions as describe in RFC3490 Section 4.

Let’s go over one example.

The goal is to ASCII translate the UTF-8 domain name: 💻.local in order to access my laptop.

This require 2 steps:

1) Translate to Punycode

To translate ASCII to UTF-8 back and forth we need either libidn, libidn2 or python3 installed.

In python3:

>>> u'💻'.encode('idna') + b'.' + b'local'
b'xn--3s8h.local'

In libidn:

cat \
<(echo -n 'xn--') \
<(idn -e -- 💻 | tr -d '\n') \
<(echo -n '.') \
<(echo -n 'local')
xn--3s8h.local

As you can see we don’t encode 💻.local directly, we encode each . separately.

A label is an individual part of a domain name. Labels are usually
 shown separated by dots; for example, the domain name
 “www.example.com" is composed of three labels: “www”, “example”, and
 “com”.

I see a hardcoded xn-- in our example! What’s the hell is it about?

Remember I told you about a special flag you could have used to switch between encoding methods? Well xn-- acts a little bit like it. Whenever an IDNA ready client wants to parse an ASCII string, if it sees the ACE prefix as defined as RFC3490 Section 5 it knows it’s Punycode. So it knows it has to display an UTF-8 string instead of pure ASCII.

The ACE prefix for IDNA is “xn — “ or any capitalization thereof.
This means that an ACE label might be “xn — de-jg4avhby1noc0d”, where
 “de-jg4avhby1noc0d” is the part of the ACE label that is generated by
 the encoding steps in [PUNYCODE].

Gotcha, now what?

Now that we have our Punycode string xn--3s8h.local we can write it in /etc/hosts in order to query it.

Put your emojis DNS in chrome and…:

I had a local server listening on :8080
Chrome did translate unicode in punycode!

And voilU+00E0! We successfully created our own emojis DNS.


Summary

Today you went over how to encode UTF-8 to binary, learnt about IDNA's challenges and the punycode solution.

I hope you had fun and learnt like I did to write this article.

Next time we’ll use the emojis based DNS to generate an SSL certificate on letsencrypt!


[1] The table is split in 4 parts:

Code point between U+0000 and U+007F:

Number of bytes: 1
Bits for code point: 7
1st byte: 0xxxxxxx

Code point between U+0080 and U+07FF:

Number of bytes: 2
Bits for code point: 11
1st byte: 110xxxxx
2nd byte: 10xxxxxx

Code point between U+0800 and U+FFFF:

Number of bytes: 3
Bits for code point: 16
1st byte: 1110xxxx
2nd byte: 10xxxxxx
3rd byte: 10xxxxxx

Code point between U+10000 and U+10FFFF:

Number of bytes: 4
Bits for code point: 21
1st byte: 11110xxx
2nd byte: 10xxxxxx
3rd byte: 10xxxxxx
4th byte: 10xxxxxx