ASCII, ANSI, Unicode, and UTF-8

JimmyBear
3 min readMar 28, 2020

--

ASCII

Use 1 Byte to represent a character, the range is from 0x00 to 0x7F.

The characters are encoded by the ASCII table.

For example, J is 0x4A encoded by ASCII table.

Source: http://www.asciitable.com/

ANSI

Use 2 Bytes to represent a character.

The characters are encoded by Windows code pages, such as Big5(CP950), Shift_JIS(CP932), or Latin-1(CP1252).

For example, 人 is 0xA448 encoded by CP950.

Source: http://ash.jp/code/cn/big5tbl.htm

Unicode

Use 2 Bytes to represent a character.

The characters are defined as code points in the Unicode table.

For example, 人 is defined as 0x4EBA in the Unicode table.

Source: https://unicode-table.com/en/

UTF-8

A variable length character encoding for encoding all code points in Unicode.

Use 1~4 Bytes to represent a character.

Source: https://en.wikipedia.org/wiki/UTF-8

For example, 人 is 0xE4BABA encoded by UTF-8.

Source: https://www.branah.com/unicode-converter

Convert characters encoding from Unicode to ANSI or UTF-8

Unicode code points can be encoded to ANSI or UTF-8, ANSI and UTF-8 can be decoded to Unicode code points vice versa.

ANSI <-> Unicode <-> UTF-8

Unicode to ANSI (Convert by Windows API)

Use WideCharToMultiByte with the specified Windows code pages to convert Unicode to ANSI.

You can see the 人 is stored as 0x4EBA in std::wstring, which is the Unicode code point.

After converted Unicode to ANSI, 人 is stored as 0xA4, 0x48 in std::string, which is encoded by Big-5.

After converted Unicode to UTF-8, 人 is stored as 0xE4, 0xBA, 0xBA in std::string, which is encoded by UTF-8.

Unicode to ANSI (Qt)

Use the QString member function toLocal8Bit() to convert Unicode to ANSI, and use toUtf8() to convert Unicode to UTF-8.

You can see the 人 is stored as 0x4EBA in QString, which is the Unicode code point.

After converted Unicode to ANSI, 人 is stored as 0xA4, 0x48 in std::string, which is encoded by Big-5.

After converted Unicode to UTF-8, 人 is stored as 0xE4, 0xBA, 0xBA in std::string, which is encoded by UTF-8.

--

--