While reviewing a recent pull request, I realized I didn’t fully understand how MySQL stores UTF-8 characters. It is especially confusing due to inconsistencies with the family of text column types (TINYTEXT, TEXT…) that we use for aggregate storage (such as JSON blobs) versus typical string (VARCHAR) fields for most of our Rails models.
After some extensive research, below is an overview of how UTF-8 is stored in MySQL across the various data types.
UTF-8 is a multi-byte encoding of the Unicode code points. The Wikipedia article on UTF-8 is excellent. Originally, when Unicode was a 16-bit standard (code points U+0000 to U+FFFF), UTF-8 was variable from 1 to 3 bytes.
Later, Unicode was extended beyond 16 bits (code point U+10000 and beyond) to make room for some ancient languages. UTF-8 was extended to 4 bytes max. This is when MySQL added support for UTF-8, but to optimize storage they only supported the 3-byte form. Later, more emoji were added to Unicode beyond the ancient languages, and MySQL added a new character enocding
utf8mb4 to support these.
In MySQL 5.x, sticking with the 3-byte form improved performance, but with the downside of limited emoji support. MySQL 8.0 apparently has major speed improvements for
utf8mb4 and actually deprecates
For these MySQL types, the count given in parentheses is interpreted as characters, not bytes. To match the SQL spec, MySQL doesn’t allow extra characters even if there are enough bytes. For example, with
utf8(mb3) encoding, MySQL will reserve 30 bytes to hold a VARCHAR(10) column. Even though an 11 character string of ASCII only needs 11 bytes and would therefore fit, MySQL will reject the 11-character string as too long.
TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
For these MySQL types, the documented size limit (2⁸, 2¹⁶, 2²⁴, 2³²) are given in bytes, not characters. MySQL will store any text that fits. So this is the opposite of CHAR and VARCHAR! For example, for a TINYTEXT field with a maximum size of 255 bytes, a 255-character ASCII string can be stored. But consider a string of 3-byte UTF-8 characters like ☃. Only 85 of them will fit. Or newer 4-byte emoji like 😹 will work even with a database default of
utf8mb3...but only 63 of them will fit.
Hobo Fields Migrations for Text
We use Hobo Fields to declare fields in our models and manage our Rails migrations. The Hobo Fields schema generator allows any arbitrary limit to be set for a
text field, but MySQL only supports the 4 powers of 2^8 given above. And Hobo Fields interprets the limit as worst-case characters so it applies a 3X conversion between characters and bytes. So the only valid limits are (2^8 - 1)/3, (2^16 - 1)/3, (2^24 - 1)/3, (2^32 - 1)/3. We encapsulated those as constants to use in our models:
MYSQL_BYTES_PER_UTF8_CHARACTER = 3MYSQL_TINY_TEXT_UTF8_LIMIT = 0x0000_00FF / MYSQL_BYTES_PER_UTF8_CHARACTER # 85 characters
MYSQL_TEXT_UTF8_LIMIT = 0x0000_FFFF / MYSQL_BYTES_PER_UTF8_CHARACTER # 21,845 characters
MYSQL_MEDIUM_TEXT_UTF8_LIMIT = 0x00FF_FFFF / MYSQL_BYTES_PER_UTF8_CHARACTER # 5,592,405 characters
MYSQL_LONG_TEXT_UTF8_LIMIT = 0xFFFF_FFFF / MYSQL_BYTES_PER_UTF8_CHARACTER # 1,431,655,765 charactersFootnote: MySQL vs Ruby MethodsNote that Ruby and MySQL took opposite approaches to mapping the concept of "length" to characters vs bytes. See length vs. LENGTH in the table below:RubyMySQLcharacter.size