Understanding UTF in Javascript

Kumar Abhishek
webisbeautiful

--

Overview

  • Javascript uses UTF-16 encoding for String.
  • Character forming string can be from BMP or non-BMP.
  • Length of string is nothing but number of units of 16-bits memory block. For example if length of string is 4, it means it has 4 units of 16-bits wide memory blocks.
  • Every BMP characters can be represented using one unit of 16 bits memory block.
  • Every non BMP characters can be represented using 2 units of 16-bits memory blocks (i.e 32 bits or 4 octet). For example Japanese or emoji characters can be of size 2 units.
  • Mathematically speaking size or length of a string in Javascript is of the form x + y, x is total number of characters (including both BMP and non-BMP characters) and y is total number of non BMP characters (y is 0 if string has only BMP characters). That is for every non BMP character one extra unit of 16-bits memory block is contributed to the total size or length of that string.

Illustration

Let’s consider a string ‘Hello😛’. Now we get 7 as length of this string:

'Hello😛'.length; // 7

Here 'H', 'e', 'l', 'l', 'o', '😛' are amounting to 6 units to size (or length) of entire string. As '😛' is non BMP character, its will contribute to 1 extra unit. In other words string has 1 non-BMP character which leads to total size of above string to be 6 + 1 = 7. So relating it with mathematical formula x + y, we can write x = 6(total number of characters) and y = 1(number of non BMP character '😛').

Implications

We have seen that string in Javascript is of the form x + y. Also string in Javascript is nothing but array of memory blocks and not characters. In other words, a particular memory block in a string s at particular index position ican be retrieved using s[i]. The biggest misconception among people is that s[i] mean character at position i in string s. This is only true for such string which has only BMP characters. The moment non-BMP character occur in string, position of all subsequent characters gets affected due to the fact non-BMP character span over 2 units of memory block. This can be understood very easily by following example:

Consider a string s ‘Hello😛World’. So when we access string s unit by unit of memory block till character ‘o’, we get:

s[0]  'H'
s[1] 'e'
s[2] 'l'
s[3] 'l'
s[4] 'o'

What do you think value of s[5] would be ? Well we will get '�', as the value of memory block s[5] does not map to any printable/renderable character. '�' is just a placeholder picked by browser to depict that it’s not able to render such character. Same is true for s[6] which shows '�'. We can get decimal equivalent of s[5] and s[6] or the data stored in memory block s[5] and s[6] as below:

s[5] // '�'
s[6] // '�'
s[5].charCodeAt(0) // 55357
s[6].charCodeAt(0) // 56859

Writing value of memory block at s[5] and s[6] in Hexadecimal we get:

s[5] // 0xD83D
s[6] // 0xDE1B

So together s[5] and s[6] represent ‘😛’. That is, data value pair 0xD83D and 0xDE1B together represent one non-BMP character ‘😛’. This data pair (pair of 16-bits value) is called surrogate pairs.

Challenges

We saw above that 2 units of memory blocks (surrogate pair) are needed for each non-BMP character, this screw up the normal iteration over characters of string in Javascript. This hinders consideration of string as array of character for number of purpose like counting frequencies of character in a string, replacing character in a string, appending character(s) in a string, etc. In short any kind of string related calculation in terms of character get erroneous with the presence of non-BMP character in string. Luckily we have workaround for this problem. We will see how this workaround works next.

Workaround

For any practical purpose of considering string as array we first need to break string into an array of characters (of both types: BMP and non-BMP). That is, this array will have exactly same number of items as the number of characters (be aware it not length or size of string !). Considering the same example string 'Hello😛World' once again we can convert it into array of characters as below:

const [...ca] = 'Hello😛World';
ca // (11) ["H", "e", "l", "l", "o", "😛", "W", "o", "r", "l", "d"]

As in Javascript array items need not be homogeneous and each items can be of any data type as well as of any data size. So above character array ca,comprises of 11 items each representing a character (of type both BMP and non-BMP seamlessly). This is what we wanted for our string based processing. For example if we want to count frequencies of characters in above string we can get it as below:

const [...ca] = 'Hello😛World';
ca.reduce((obj, v) =>
{
obj[v] = obj[v] || 0;
++obj[v];
return obj;
}, {});
H: 1
e: 1
l: 3
o: 2
😛: 1
W: 1
r: 1
d: 1

Conclusion

  • String in Javascript is made up of UTF-16 characters with each character having size of 1 (in case of BMP character) or 2 units (in case of non-BMP characters) of 16-bit memory blocks.
  • String in javascript is not array of characters but array of memory blocks each of size 16-bits.
  • Each non-BMP character span over 2 unit of 16-bits memory thereby contributing one extra unit per character to the length of string.
  • Length or size of javascript string can be represented mathematically x+y where x is total number of characters and y is total number of non-BMP characters.
  • ES6 spread operator can be used to convert string in to array of characters.

--

--