Emojis — From a programmer’s eye

Published in

Bobble Engineering

7 min readOct 6, 2020

Emojis have become an inherent part of our conversations. More than 700M emojis are shared daily across all chat applications. The most fascinating thing about emojis is that the same emoji can be used on different platforms and applications to express the same emotion, but in a different way. Take, for example, WhatsApp and Facebook — the same emoji have different graphical representations on both the platforms, though the expression remains the same. As a user, we get these emojis served to us in a platter of 3304 (and counting) globally recognised emojis under 9 different categories. If you are wondering the recipe behind this, you have landed the right page.

How does a computer handle a emoji?

Emojis are the same as symbols of other languages, but with a different script. To understand how computers handle emojis, one must know how about Unicode and different encoding formats. Let’s answer this question into different sections. Feel free to skip any section you are already aware of.

Unicode
Unicode encoding formats
Emojis

What is Unicode?

Unicode is a globally accepted encoding standard for texts. Wait, what is encoding? Any text that you see on the computer is stored in bits. Each character is mapped to an integer whose binary can be stored in computer memory. These bits are reversed mapped by the computer and a corresponding glyph (a graphical representation, the fundamental unit of a font) is drawn on the screen. Encoding is the process of modifying the mapped numbers to bits which can be stored, transmitted and understood by the computer. You may have been reminded of ASCII. Yes, it’s the same thing. Unicode is just an extension to the existing ASCII system.

Why is Unicode needed?

ASCII is a very old system and can map a maximum of 128 characters (which includes English alphabets and numbers). This is very less taking into consideration, the number of characters and symbols in other languages as well. Therefore, Unicode came into existence to cover the existing as well as the foreseen list of characters and symbol and establish a standard of encoding. This is the reason why text sent from one part of a world using one device is perceived the same in another part of the world using a different device.

As of March 2020, the Unicode repository consists of 143,859 characters which include almost all language’s alphabets, special characters, numbers and emojis too. Current encoding standards are capable of up to 2,097,152 characters. You may be wondering that “I still use the ASCII for my programs and yet it works fine!”. Yes, because Unicode is backwards compatible with ASCII.

Unicode encoding formats

How is Unicode able to store so many characters? Current count (2,097,152) would require at least 18 bits (or 3 bytes) to store one character. And since the number keeps growing with new additions, we may need more bytes to store them.

Imagine this blog of 9230 characters(including spaces and newlines) if encoded with ASCII would take 9.2KB bytes and the same using 3 bytes/character would take 27.6KB bytes, though ASCII was sufficient enough to encode this blog. This causes inefficiency in storing, transmitting and parsing the data.

To overcome this problem, various encoding techniques exist out of which the most famous ones are UTF-8, UTF-16 and UTF-32. ASCII as we know is fixed-width encoding, which means that each character will occupy a single byte whereas UTF supports variable-width encoding.

Fixed-length and Variable-length encodings

UTF-8 and UTF-16 are variable-length encodings which means that different numbers can be encoded with different numbers of bytes depending n the requirement of the number. UTF-8 and UTF-16 can encode a single character in a minimum of 1 and 2 bytes respectively and a maximum of 4 bytes each. UTF-32, on the other hand, is fixed-length encoding. Each encoded number takes a fixed length of 4 bytes.

Fixed-length encodings are very straight forward and no transformations are required. The binary of the character to be encoded is stored in a fixed number of bytes(say x). The decoder reads x bytes of data in each iteration to extract the character. This, in case of variable-length encoding, will cause ambiguity as the decoder will not know if a given byte is an independent number or it should be combined with the trailing bytes.

Hence, we need to stuff extra information in the bit sequence so that consecutive bytes of the same number can be glued together and are interpreted as a single number. This can be done by adding glue bits to the existing binary sequence. A single unit of the encoded character after adding the glue bits is called a code-point(1 byte in case of UTF-8 and 2 bytes in case of UTF-16). Confusing, right? Let’s see an example.

Variable-length encodings at play

The 1&0’s in the chart below are the glue bits which helps the decoder to know that the character is made up of multiple bits instead of a single one, and the ‘x’ represents the bits of the character to be encoded.

Let’s encode the emoji ‘😃’ whose Unicode is U+1F603(This means that the mapping of the character is 0x1F603). The binary of this would be 11111011000000011 (17 bits).

UTF-8: The 1st, 2nd and 3rd format can’t fit this much data so we choose the 4th format.

UTF-8 encoding chart(xxxx… = binary of number to be encoded)

UTF-16: UTF-16 works similarly but with some additional processing. For the laughing emoji, the 1st and 2nd format can’t fit the data, so 3rd format is chosen.

UTF-16 encoding chart(xxxx… = binary of number to encoded)

Here in the table, yy..yy = xx..xx — 0x10000 (Subtract 0x10000)

Emojis

Now, that you know Unicode, you can encode all the emojis. Check out here for the list of all latest emojis and their unicodes. It depends on the platform where you are trying to render the emoji, whether the emoji is supported or not. This is because Unicode keeps updating itself, and old devices don’t get the updates so frequently. “A device doesn’t support this emoji” simply means that the device has no font mapping of that emoji and hence most devices display the fallback glyph — the tofu — ⍰.

But which encoding should I use? That depends on the programming language you are using. Let’s consider JAVA. JAVA’s char data type stores data in UTF-16 encoded format. Therefore, 1 char represents a single UTF-16 codepoint and hence takes 2 bytes (size of a code-point of UTF-16 is 2). So if you want to store and print ‘😃’, you need to store the UTF-16 codepoints(D83D & DE03) in a variable. This can be done by using the escape sequence “\u”.

String smile =”\uD83D\uDE03”;
//Prints 😃 (if the console has a glyph for this emoji)
System.out.println(smile);

P.S: When a String is saved to a file using standard file output streams, the data is automatically converted to device’s default encoding scheme(which is UTF-8 for most devices nowadays), until and unless the encoding of the stream is changed explicitly(Constructors and functions are available for such purposes).

Emoji variants

You must have seen these popups on emojis showing different variants of the same emoji based on skin-colours, genders, hair-colour, and many more. The mapping of these emojis is not done at random but in an organised manner. The variants are made by joining 2/3 extra characters — ZWJ, variation selector and a modifier character.

ZWJ is Zero Width Joiner, which has Unicode U+200D is a non-printable character. Its purpose is only to let the decoder know that a modification has been applied to the preceding character.

Variation selector (U+FE0F), similarly, is a non-printable character, used for marking the end of certain emoji sequences.

Modifiers are specific characters which add certain properties to an emoji. Some of the modifiers are:

Gender : (U+2642 — man, U+2640 — woman)
Hair : (U+1F9B0 — red hair, U+1F9B1 — curly hair, U+1F9B2 — bald, U+1F9B3 — white hair)
Skin tones : (U+1F3FB — light,U+1F3FC — medium-light, U+1F3FD — medium, U+1F3FE — medium-dark, U+1F3FF — dark)

E.g : The emoji 👌 (U+1F44C) can be converted to 👌🏻 (1F44C 1F3FB), 👌🏼(1F44C 1F3FC), 👌🏽(1F44C 1F3FD), 👌🏾(1F44C 1F3FE) and 👌🏿(1F44C 1F3FF)

A lot of emojis can be cooked by permutating several emojis and modifiers. Not only emoji and modifiers can be mixed, but emoji with another emoji can also be mixed.

E.g, the black flag and the skull&crossbones can be merged to get a pirate flag! But, there is a limited number of officially recognised such sequences which can be found here.

Conclusion

Emojis have been a part of our conversations since a very long time and have evolved now and then. Technology has always played part in enriching day-to-day conversations and making them more expressful and enjoyable and will continue to do so.