UTF-16 is a standard method of encoding data. but what actually encoding means?
Every day we are transmitting a tremendous amount of data from one place to another via some communication channel, and that channel only understands the binary data that are packed in the form of bits normally called packets. This is what the encoding is, the conversion of readable data by some standard (UTF-8, UTF-16) into the equivalent bits/cipher is called encoding.
Again, what is UTF-16? Before diving into the UTF-16 standard, Let’s understand the ASCII and UTF-8 first. But, why ASCII and UTF-8? because UTF-16 is the superset of UTF-8 and UTF-8 is the superset of ASCII.
Let’s talk about ASCII first.
What is ASCII?
ASCII is the first Encoding Scheme, which encodes our data but it is only limited by 128 characters(256 extendable-size). ASCII can only encode the most common English characters, numbers, punctuations, etc. ASCII uses 7-bits to represent a character, by using 7-bit we can have a maximum of 2⁷ i.e., 128 distinct combinations which generally means we can only represent 128 maximum characters.
You can use the following link to know about the ASCII values of characters, numbers, and punctuations.
Introducton I adapted this information from a web site and I have made it available locally. ASCII stands for American…
Now, let’s dive into the UTF standards. But, before studying UTF, the problem with ASCII is that it is only limited to 128 characters means ASCII can encode only some amount of data, but nowadays we are dealing with a very huge amount of data and that data can be of any type and may be in any language.
All over the entire world, there is about 600 type of natural languages and that is used by the people of different region and different countries. Now the question we should ask ourselves is how the data is being transferred and encoded such that the same data is received by another person in a language that the receiver understands.
This is where UTF Standards Came into the picture.
UTF(Unicode Transformation Format) is the standard for representing a great variety of characters from any language. To overcome the problem of ASCII which is only limited to 128 characters, UTF was developed to encode all characters for each and every language.
UTF-8 is a variable-size encoding.It is mainly worked by manipulating numbers(code point) at the binary level.
Here, variable- size encoding defines:
- 1-byte encoding is only for characters ranging from 0–127.
- 2-byte encoding is only for characters ranging from 128–2047.
- 3-byte encoding is only for characters ranging from 2048–65535.
- 4-byte encoding is only for characters ranging from 65535–1,112,064.
In UTF-8, high order bits of each byte tells how many bytes are used to encode values. Let me explain this more deeply.
- For 1-byte, the high order bit is 0 rest 7 bits are used to encode the actual character.
- For 2-byte, the high order bit for 1st byte is 110 and for 2nd byte 10.
- For 3-byte, the high order bit for 1-byte is 1110, for 2-byte 10 and for 3rd byte 10.
- For 4-byte, the high order bit for 1-byte is 11110, for 2-byte 10, for 3rd byte 10, and for 4-byte 10.
Let’s take an example:-
How UTF will encode the data let’s take HexaDecimal number 1FACBD?
We have converted our Hexadecimal number to Binary(4-bit) format.
This is how our binary data will look like in 8-bit blocks.
We know that our Hexa number is greater than ffff which means we have to use 4-byte encoding. In 4-byte encoding, high order bits for 1-byte is 11110, for 2-byte 10,3-byte 10, and for 4th is 10.
After adding the high order bits to every 8-bit blocks, all the binary values of a hexadecimal number is added in the respective blocks and get a new Hexadecimal Number which is also known as Encoded data.
The final encoded value will be: F7BAB2BD
This is how the UTF-8 standard works.
Now, What is UTF-16?
As we know that UTF-16 is the superset of UTF-8.
UTF-8 and UTF 16 are only two of the established standards for encoding. They only differ in how many bytes they use to encode each character. Since both are variable-width encoding I’ve briefly described above, they can use up to four bytes to encode the data, UTF-8 only uses 1 byte (8bits) and UTF-16 uses 2 bytes(16bits).
Let's look at this example:
In the above code, these Unicode escapes begin with the characters \u and are followed by exactly four hexadecimal digits.
In this above example, ‘¿Cómo estás?’ is the Spanish String, we know that UTF-16 uses 2-bytes(16-bits) to represent characters, and if we run this code we’ll get something like:
UTF-16 uses 2 bytes but here we are getting 12. how?
because It could accommodate many more characters than Unicode currently allows.
Finally, I hope this explanation makes sense to you. Feel free to Clap and Comment. Have fun!