DNS emojis using punycode
In this post we’re going to create a DNS name containing emojis.
Why? Well because it’s cool and we’re going to learn a lot! But mainly because it’s fun — and totally useless.
This post will cover:
- Code points
- Encoding
UTF-8
code points to binary - Internationalizing Domain Names in Applications (
IDNA
) - Punycode
Prerequisite:
- You will need
libidn
,libdn2
orpython3
installed on your computer
Emojis — like the ones on your favorite smartphone — use unicode
code points which are encoded in UTF-8
and transmitted in binary from devices to devices.
In order to encode a character, we need to map them to a certain value. This is called Code Points
. Because they point a certain value to a character. For example the famous smiley face emoji ‘😀’ is U+1F600
which has the decimal code point value of 128512
.
Another example in which you could be more familiar with, is ASCII
. American Standard Code for Information Interchange encodes 128
characters in 7
bits where A
is code point 0x41
or 65
. Obvious limitations of ASCII
led to other designs like UTF-8
.
The encoding process of UTF-8
is completely different from ASCII
. The difference resides in the way of converting the code point value to binary. In ASCII
the code point value is the binary value. For instance: A
is 65
which is then 001000001
.
In UTF-8
we have to do more work. A single character can take between 1
and 4
bytes. The tricky part being the code point value: is it not the binary value.
To understand what the encoding process is, let’s go back to our smiley face example.
If we want to encode our smily face to binary here’s how we would do it:
- Lookup in the table in [1] and find where your unicode character falls in.
- The table says our unicode character needs to be
4
bytes long and its code point value needs to be21
bits long. - So we start by converting the code point value. Here
1F600
becomes:0001
1111
0110
0000
0000
. - Add as many leading
0
as you want in order to match the number of required bits. The bit string is going to be000011111011000000000
— we had20
bits so we have to add one0
. - Each
UTF-8
byte has a “template”. Let’s take the first byte’s template in the table:11110xxx
. You have to replacex
s with bits from your bit string. - 1st byte has to start with
11110
so we grab the first3
bits and make it up a byte:11110000
- 2nd byte has to start with
10
so we grab the following6
bits and make it up a byte:1001111
- 3rd byte has to start with
10
so we grab the following6
bits and make it up a byte:10011000
- 4th and last byte has to start with
10
so we grab the last6
bits and make it up a byte:10000000
- Final step — optional — is to convert everything back to hexadecimal:
0xF0
0x9F
0x98
0x80
for pretty output.
That wasn’t so hard, wasn’t it? Wait a second! It’s not quite done. We now have 4
bytes for only one character. If we send 0xF09F9880
through the wire to a DNS
server it is going to decode it using ASCII
and then think it is 4
characters long. Not good.
So, we have 2 options:
- Change the default character decoding to
UTF-8
. - Convert
UTF-8
toASCII

Option 1. seems good because, more and more programs are by default using UTF-8
in order to keep compatibility between them. But doing so would require to upgrade every DNS server.. at the same.. time. Along with every DNS client too. You could also add a flag to specify the what the encoding is going to be, but what if you awesome emojis name gets an answered by an old DNS server?
Well you probably thought it was a bad idea. We’re going to look at option 2. Converting UTF-8
to ASCII
would be the answer, since no infrastructure change would be required! The only thing that could be nice is to have the capability to go back from ASCII
to UTF-8
on the client to display our emojis 👍.
Hopefully some super smart people already answered the question. RFC5891 talks about this:
IDNA
works by allowing applications to use certainASCII
string labels (beginning with a special prefix) to represent
non-ASCII
name labels. Lower-layer protocols need not be aware of
this; therefore,IDNA
does not change any infrastructure. In
particular,IDNA
does not depend on any changes toDNS
servers,
resolvers, orDNS
protocol elements, because theASCII
name service
provided by the existingDNS
can be used forIDNA
.
That seems like what we want!
Enter Punycode
Punycode is the way of converting back and forth ASCII
and UTF-8
. It works by applying ToASCII
or ToUnicode
functions as describe in RFC3490 Section 4.
Let’s go over one example.
The goal is to ASCII
translate the UTF-8
domain name: 💻.local
in order to access my laptop.
This require 2 steps:
1) Translate to Punycode
To translate ASCII
to UTF-8
back and forth we need either libidn
, libidn2
or python3
installed.
In python3
:
>>> u'💻'.encode('idna') + b'.' + b'local'
b'xn--3s8h.local'
In libidn
:
cat \
<(echo -n 'xn--') \
<(idn -e -- 💻 | tr -d '\n') \
<(echo -n '.') \
<(echo -n 'local')
xn--3s8h.local
As you can see we don’t encode 💻.local
directly, we encode each .
separately.
A label is an individual part of a domain name. Labels are usually
shown separated by dots; for example, the domain name
“www.example.com" is composed of three labels: “www”, “example”, and
“com”.
I see a hardcoded xn--
in our example! What’s the hell is it about?
Remember I told you about a special flag you could have used to switch between encoding methods? Well xn--
acts a little bit like it. Whenever an IDNA
ready client wants to parse an ASCII
string, if it sees the ACE
prefix as defined as RFC3490 Section 5 it knows it’s Punycode
. So it knows it has to display an UTF-8
string instead of pure ASCII
.
The ACE prefix for IDNA is “xn — “ or any capitalization thereof.
This means that an ACE label might be “xn — de-jg4avhby1noc0d”, where
“de-jg4avhby1noc0d” is the part of the ACE label that is generated by
the encoding steps in [PUNYCODE].
Gotcha, now what?
Now that we have our Punycode
string xn--3s8h.local
we can write it in /etc/hosts
in order to query it.
Put your emojis DNS
in chrome
and…:


And voilU+00E0
! We successfully created our own emojis DNS
.
Summary
Today you went over how to encode UTF-8
to binary, learnt about IDNA
's challenges and the punycode
solution.
I hope you had fun and learnt like I did to write this article.
Next time we’ll use the emojis based DNS
to generate an SSL
certificate on letsencrypt
!
[1] The table is split in 4 parts:
Code point between U+0000
and U+007F
:
Number of bytes: 1
Bits for code point: 7
1st byte: 0xxxxxxx
Code point between U+0080
and U+07FF
:
Number of bytes: 2
Bits for code point: 11
1st byte: 110xxxxx
2nd byte: 10xxxxxx
Code point between U+0800
and U+FFFF
:
Number of bytes: 3
Bits for code point: 16
1st byte: 1110xxxx
2nd byte: 10xxxxxx
3rd byte: 10xxxxxx
Code point between U+10000
and U+10FFFF
:
Number of bytes: 4
Bits for code point: 21
1st byte: 11110xxx
2nd byte: 10xxxxxx
3rd byte: 10xxxxxx
4th byte: 10xxxxxx