What is a surrogate?

Igor Komolov
1 min readNov 25, 2023

--

In Unicode, a surrogate is a code point that is used in pairs to represent characters that are not part of the Basic Multilingual Plane (BMP). The BMP contains characters that are most commonly used, which are represented by code points from U+0000 to U+FFFF.

However, Unicode defines a much larger range of characters. To represent characters beyond the BMP (ranging from U+10000 to U+10FFFF), surrogate pairs are used in UTF-16 encoding. A surrogate pair consists of two 16-bit code units:

  • A high surrogate (in the range U+D800 to U+DBFF)
  • A low surrogate (in the range U+DC00 to U+DFFF)

Together, these pairs form a single character. For example, a character beyond the BMP is represented in UTF-16 as a combination of one high surrogate and one low surrogate. This approach allows UTF-16 to represent a large number of characters while remaining compatible with the BMP’s 16-bit structure.

--

--

Igor Komolov

Founder & CTO who writes code, day trades, cycles, golfs, takes pictures, makes art and reads ALOT.