Dan Fyfe
3 min readApr 16, 2019

--

“Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.” *

History

The Unicode Consortium was established in 1991 after years of discussing a new character encoding called ‘Unicode’ by Xerox’s Joe Becker and Apple’s Lee Collins and Mark Davis.

Unicode Consortium

The Unicode Consortium’s goal is to develop and promote the use of the Unicode Standard and other related standards throughout the world.

Unicode Standard

Simply put, the Unicode Standard uses numbers to represent a character set and leaves the conversion up to whatever software that is being used. There are plenty of other standards that do similar things, but the Unicode Standard is compatible with the most used international standards, which makes multilingual computing much easier.

A few other Standards

ASCII: (ASS-kee) American Standard Code for Information Interchange

  • encodes 128 characters into 7-bit integers
  • contains English alphabet (capital and lowercase), numerical digits 0–9, and 33 non-printable control codes that were designed to give printers instructions, but are no longer used

ISO/IEC 8859: International Organization for Standardization/ International Electrotechnical Commission

  • a series of standards to encode the various Latin characters needed to express many European languages

GB 18030: Official Character Set of The People’s Republic of China

  • supports simplified and traditional Chinese

Why is Unicode important?

Fundamentally, computers deal with numbers, not characters or letters. The Unicode Standard describes useful characters as numbers to enable software to convert data easily and without a high risk of corruption. Before there was a general, unified standard, computers would use proprietary systems to convert text, which could easily lead to corrupt and incorrect data if the information was being passed around to different computers. The need for a standard became undeniable once the Internet allowed for enormous amounts of data to be passed around by tons of computers, whose users preferred uncorrupted data.

How does it work?

  • The Unicode Standard uses a few types of character encoding, but the smallest, more usable type is UTF-8 (Unicode Transformation Format 8bit). Other types include UTF-16, UTF-32, but they are not backwards compatible with ASCII, while UTF-8 is. HTML 5 supports both UTF-8 and UTF-16. Major operating systems, like Microsoft Windows, utilize UTF-16.
  • UTF-8 encodes characters using 1 to 4 bytes per character. The first 128 characters are encoded in the exact same way as ASCII, which makes them completely backwards compatible. It uses Decimal as well as Hexadecimal values to represent the characters.
  • Unicode is the character set that translates characters to numbers and UTF-8 is the encoder that translates numbers into binary.

Below are visualizations of ASCII and UTF-16 to help understand the massive difference in capabilities.

ASCII characters **
UTF-16 ***

There is a lot more to the Unicode Standard. Explore the sources below for in depth reading!

Now for the fun stuff: Emojis!

Unicode also deals with the wonderful world of emojis! The Unicode Consortium accepts submissions to alter or add emojis. The submission process is somewhat lengthy and is surprisingly in-depth considering some of the subject matters. However, one could argue there is significant cultural weight behind the emojis themselves. Below is a link to a submission that a few of my classmates might find very exciting!

Conclusion

Unicode is an important part of computing in today’s world. One can argue that the Internet/World Wide Web would not work as well as it does without it. This post only scratches the surface of a very large system. It would be worth checking out the sources below if you are interested in a more technical look at Unicode. Thank you to all of my sources!

Thanks for reading!

Sources

--

--