ASCII, ANSI, Unicode, and UTF-8

3 min readMar 28, 2020

ASCII

Use 1 Byte to represent a character, the range is from 0x00 to 0x7F.

The characters are encoded by the ASCII table.

For example, J is 0x4A encoded by ASCII table.

Use 2 Bytes to represent a character.

The characters are encoded by Windows code pages, such as Big5(CP950), Shift_JIS(CP932), or Latin-1(CP1252).

For example, 人 is 0xA448 encoded by CP950.

Use 2 Bytes to represent a character.

The characters are defined as code points in the Unicode table.

For example, 人 is defined as 0x4EBA in the Unicode table.

A variable length character encoding for encoding all code points in Unicode.

Use 1~4 Bytes to represent a character.

For example, 人 is 0xE4BABA encoded by UTF-8.

Unicode code points can be encoded to ANSI or UTF-8, ANSI and UTF-8 can be decoded to Unicode code points vice versa.

Use WideCharToMultiByte with the specified Windows code pages to convert Unicode to ANSI.

You can see the 人 is stored as 0x4EBA in std::wstring, which is the Unicode code point.

After converted Unicode to ANSI, 人 is stored as 0xA4, 0x48 in std::string, which is encoded by Big-5.

After converted Unicode to UTF-8, 人 is stored as 0xE4, 0xBA, 0xBA in std::string, which is encoded by UTF-8.

Use the QString member function toLocal8Bit() to convert Unicode to ANSI, and use toUtf8() to convert Unicode to UTF-8.

You can see the 人 is stored as 0x4EBA in QString, which is the Unicode code point.

After converted Unicode to ANSI, 人 is stored as 0xA4, 0x48 in std::string, which is encoded by Big-5.

After converted Unicode to UTF-8, 人 is stored as 0xE4, 0xBA, 0xBA in std::string, which is encoded by UTF-8.