How to Easily Handle Emoji Unicode in Java
Emoji Challenge
Building telegram bots are fun. Choosing Java to create a bot and use emojis sent me into a tricky situation.
The unicode code points for emoji must be converted to surrogate sequence for Java code to process it correctly, otherwise the character will not be rendered rightly to visualize.
Java needs surrogate pair for the unicode point, which is a bit daunting to start with, and it is even more crazier to keep it synced periodically with the list of emojis that get created. This is best automated, so that when things change, code can be adapted easily.
As of this writing on Sep 27, 2020; there are 1816 code points. New ones are created and tracked here.
This article demonstrates the solution by applying ETL (extract-transform-load) design pattern to generate partial Java code from HTML page!
1. Analysis
Understand document structure
The unicode.org’s full listing of emoji page is rendered as shown below. I’m illustrating the elements required to create a predictable connection between human readable name and emoji unicode point.
The page consists of HTML table. The table is split into multiple sections (with table header elements). The table rows represent the relevant unicode point information, and the columns contain specific values of interest.
Model a class to store row entries
Let’s create a Java POJO UnicodePointEntry
to extract the web page content into a structured format. This class provides a method to convert unicode surrogate pairs into a visually representable emoji with toEmoji()
.
2. Extraction
Let’s create HTML downloader to process unicode page listing, to convert the codes and human readable names into a valid UnicodePointEntry
.
Note: This EmojiUnicodePointAndValueMaker
class uses JSoup library to process HTML content
3. Loading
Create a placeholder Emoji enum class, that will help us represent the extracted unicode values as an enum value.
Note: An emoji may have a sequence of code points.
4. Bringing the extract-load-process together!
Now that the HTML page extraction code is ready, it can be put to use in a test class
Note: This tests are based on JUnit
5. Code
Code is available in GitHubTo print emojis onto display, run
mvn -DenumCompatibleSyntax=falseTo generate enum values onto display, run
mvn -DenumCompatibleSyntax=true
6. References
How unicode works?List of emoji chartsSurrogate Pair Calculator
7. Loading generated values into Emoji class
Here is a random list of enum values generated, values truncated for brevity.
grinning_face(0x1F600),
grinning_face_with_big_eyes(0x1F603),
grinning_face_with_smiling_eyes(0x1F604),
beaming_face_with_smiling_eyes(0x1F601),
grinning_squinting_face(0x1F606),
grinning_face_with_sweat(0x1F605),
face_in_clouds(0x1F636, 0x200D, 0x1F32B, 0xFE0F),
face_with_spiral_eyes(0x1F635, 0x200D, 0x1F4AB),
flag_england(0x1F3F4, 0xE0067, 0xE0062, 0xE0065, 0xE006E, 0xE0067, 0xE007F),
flag_scotland(0x1F3F4, 0xE0067, 0xE0062, 0xE0073, 0xE0063, 0xE0074, 0xE007F),
flag_wales(0x1F3F4, 0xE0067, 0xE0062, 0xE0077, 0xE006C, 0xE0073, 0xE007F)
You can copy and paste the enum value constructs into Emoji class that was created earlier.