Breaking out from the MySQL character-set hell

The MySQL charset hell. Read to know how you fall in it, what to do if you find yourself in it and how to migrate from latin1 to utf8mb4.

Update: Before you begin reading this article, this article has a lot of information which would be useful for you to understand how character-encoding works, but the main problem this article addresses exists in versions of MySQL prior to MySQL 5.8/MySQL 8. Rather than skipping through the article, I would suggest you read through it.

If you ever encountered databases and SQL (or tried to know what they are), you must’ve have stumbled upon MySQL (An open source relational database management system (RDBMS) based on Structured Query Language (SQL)). In MySQL, have you ever wondered what Character-Set and Collation mean?

Has it ever occurred, that while querying for data in your database (using any MySQL GUI or CLI) you see gobbledygook(garbled data)? For eg. You enter the following text in your application:

English to Arabic….الإنجليزية إلى العربية

but on querying for the same data in your database table you see:

English to Arabic….الإنجليزية إلى العربية

Note: Although this article focuses on the fact on how to move your database from latin1 to utf8mb4, you can follow this article to move between any two character-sets and collation.


Emergence and Rise of UTF-8

Those who know how computer encoding works would know that for a long period of time, ASCII (American Standard Code for Information Interchange) was the default encoding standard. The 8-bit/1 byte character encoding covered all the characters in the English Language, Numbers and the most commonly used special characters (!,.* and so on…). But with time, many Non-English speakers also started using computers and eventually computers started supporting these foreign languages.

So as to support these foreign languages, a good byte stream encoding system was required, which would support a wide-range of characters (English and foreign languages), since the ASCII encoding was just 1-byte and was already filled up from 0–255 (To see a list of supporting characters), a new multi-byte character set system was required. The search for a new system started in 1992 and in that year, UTF-1 was introduced, but due to its incompetence led to the development of UTF-8 in January 1993.

When I set out to create MOS 865, do you think it just happened overnight? No. There was MOS 1, that burnt down my Dad’s garage. There was MOS 2 that would only schedule appointments in January. And 862 others that I learned from…: Pete Becker (Friends)

UTF-8 (a.k.a. Unicode) quickly rose to prominence and major computer systems started adopting it mainly due to its backward compatibility with the existing ASCII system. In 2008, Google reported that Unicode became the most common encoding for HTML files.

Shows the usage of the main encodings on the web from 2001 to 2012 as recorded by Google.

If you want to know more about what Character-Sets and encoding mean and how they work and help in Computers understand languages and Software Developers develop international software, then read what Joel Spolsky has to say in his blog The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).


MySQL — Collation and Character Set

MySQL uses two basic terminologies:
1. Collation: A collation is a set of rules for comparing characters in a character set.
2. Character-Set: A character set is a set of symbols and encoding, mostly this information is derived from the type of collation.

Collation and Character-Set in MySQL are meant for strings.

MySQL defines a basic collation and a character-set for each of its databases; Furthermore, each table created can have its own collation and character-set which can be same as that of the database or different from it. Furthermore, to provide even more flexibility, a column in a table has a collation and character-set of its own, which can be same as the table or different from it. Although this gives a lot of flexibility but also increases the complexity of handling data.

With Great Power Comes Great Responsibility. Ben Parker (Spider-Man)

Latin1 and MySQL

When working on MySQL you would have seen that when a database is created or a new table is created, the default collation of the database or the table is always latin1_swedish_ci with the default character-set being Latin1. Latin1 being an extension of ASCII defines some characters used by non-english languages & also shares commonly used symbols & glyphs.

Since Latin1 is only 8-bits/1-byte, it supports a total of 256 characters, therefore, has no way of effectively representing characters in languages whose alphabets differ significantly from English (this includes languages like Russian, Greek, Hebrew, Arabic, almost all Asian languages and just about every language ever invented by humans).

Every software programmer who does not know what character-sets and encoding are, would be ignorant of the fact that MySQL uses Latin1 encoding for its database by default and would bang their heads against the wall, when the time would come for them to enable their application to support foreign languages (which do not bear similarities to English) along with proper database backup and management.

There is a very simple reason for why the default collation and character set for MySQL is latin1_swedish_ci and Latin1 respectively. Latin1 was the default character set in pre-multibyte times and it looks like that’s been continued, probably for reasons of downward compatibility, As to why Swedish, it’s because MySQL AB (the company that created MySQL) is a Swedish company. I just hope, they change the default collation to utf8mb4 in their upcoming version.


MySQL Character-Set Hell — Reproducing The Problem

To find if you’re part of “character-set” hell, checkout the following things:

  1. Decide whether you want to allow non-english speakers to use your platform, as Gridle does.
  2. From your application interface, Insert the following text, from a point where it is possible “English to Arabic….الإنجليزية إلى العربية
  3. Using the mysql-client CLI tool, connect to your application database. (Don’t use a GUI tool like phpMyAdmin to query, always stick to the CLI)
  4. Set the Database Connection variables as follows:
/*The commands below will allow your mysql client to act as a misconfigured application*/  
SET SESSION CHARACTER_SET_CLIENT = latin1;
SET SESSION CHARACTER_SET_RESULTS = latin1;

5. Query the table in which the above statement was saved using your application. Check if the data is being rendered correctly

6. Set the Database Connection variables as follows:

SET SESSION CHARACTER_SET_CLIENT = utf8mb4;
SET SESSION CHARACTER_SET_RESULTS = utf8mb4;

7. Query the table in which the above statement was saved using your application. The data would appear as gobbledygook(garbled) as follows:

English to Arabic….الإنجليزية إلى العربية

If your queried data is same as the statement above, well, your database is suffering from MySQL character-set hell.

Note: This website shows you how your data would look if it is a UTF-8 data saved in a Latin1 database(https://www.charset.org/utf8-to-latin-converter)

Fixing The Problem

When following the above-mentioned steps, the result turned out to be the one mentioned above, then you can try out the following steps, to fix your character-set encoding problem.

Note: Before trying out the steps in your production database, run the mentioned steps on your Local database, then on a replica/backup of the production database and then finally your production database but after taking a backup of the production database first.

At the end of the article, I would cite the sources that I went through to fix the database character-set issue at Gridle.

  1. Disconnect all the applications that connect to your database, while you’re performing a fix for you surely don’t want any new inserts or updates in the database during that phase since it would simply result in wrong data being inserted. Also, take a backup of the database, in case if things go wrong.
  2. Make everything UTF-8 everywhere. Update mysqld, mysql and client settings in the /etc/mysql/*.cnf file as follows:[1]
[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci

Note: If you are using mysql version 5.6 or lower, also add the following settings in the mysqld section [2]

#The following should be set if you are using mysql version 5.6 or lower
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_large_prefix=1

Note: If on an AWS RDS, the following settings have to be made in a parameter group, which would then be applied on an RDS instance [3]

[mysqld]
character_set_client: utf8mb4
character_set_database: utf8mb4
character_set_results: utf8mb4
character_set_connection: utf8mb4
character_set_server: utf8mb4
collation_connection: utf8mb4_unicode_ci
collation_server: utf8mb4_unicode_ci
innodb_file_format=barracuda
innodb_file_per_table=1
innodb_large_prefix=1

3. Restart mysql service.

4. Connect to mysql using the mysql-client CLI:

mysql -h host -u username -p

In the mysql command:

  • The -h option is used to specify the host where the database resides. It defaults to localhost (127.0.0.1) if it is not specified.
  • The -u and -p options specify the username and password of the database user respectively.
  • For more information: Checkout the Mysql official doc for mysql.

5. Execute the following query: [1]

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

The output should be as follows:

+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.05 sec)

6. Convert your Latin1 collated Database and associated tables to UTF-8.

For this part, actually there are two possible ways:

  • Method 1: Run individual queries on each table you have, altering the collation and character-set of the database, followed by the table, followed by each column in the table, where collation is latin1_swedish_ci and/or character-set is latin1. This method is useful, if you are sure, that your database table fields with type varchar, char, text etc. do not contain many non-english characters. This method is fast, but on a personal note, I am not sure how intact the data would turn out to be after completing the steps when performed on a large database containing millions of entries as Gridle does. Go for this method, if your database is small and any abnormalities could be easily detected. The detailed explanation of this method is out of the scope of this article. To know more about it, read what Wordpress Codex has to say. [4]
  • Method 2: I find this method more reliable, although it is 100% more time consuming than Method 1. This method involves taking a dump of the current database (database, character-set, table structure and table data) using the mysqldump CLI tool, updating the character-set and the collation in the dump file and finally re-inserting the fixed SQL dump back, replacing the original database and its content(associated tables and data). When following Method 2, make sure that any active applications connecting to the applications are DISCONNECTED.[5]

7. Create a database dump

mysqldump --verbose -h host -u username -p --add-drop-database --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert --databases databasename > dump_file.sql

In the mysqldump command:

  • The --verbose option is used to enable the mysqldump command to print out every action that it performs.
  • The --opt command specifies a list of other default arguments for mysqldump. Read here for more info.
  • The --skip-set-charset and -- default-char-set=latin1 options prevent MySQL from taking the already Latin1 collated table and “helpfully” converting it to any other character set for you. This should ensure that your mysqldump is really in the Latin1 character encoding scheme.
  • The --skip-extended-insert option forces mysqldump to put each INSERT command in the dump on its own line. This will make the dump take much longer to re-import, however, in my experimentation, adding this option was enough to prevent the dump from having syntax errors in anywhere. And in any case, should the re-import fail for any reason, having each row’s data on its own line really helps to be able to zero-in on which rows are causing you problems (and gives you easier options to workaround the problem rows).
  • The --databases allow dumping multiple databases. Furthermore, it also adds a CREATE and USE database command in the dump, which would allow us to change the collation and character-set of the Database as well.
  • The dump_file.sql is the file to which the mysqldump command would write the contents of the database dump. Not specifying an output file would simply lead the database dump to be printed in the terminal.
  • For more information: Checkout the Mysql official doc for mysqldump.

8. Replace all latin1 instances with utf8mb4
On the dump file, execute the following command

perl -i -pe 's/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci/' dump_file.sql

This command replaces all instances of DEFAULT CHARACTER SET latin1 with DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci. This is used to fix up the database’s default charset and collation.[6]

perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC/' dump_file.sql

This command replaces all instances of DEFAULT CHARSET=latin1 with DEFAULT CHARSET=utf8mb4. This converts all tables from latin1 to utf8mb4. [6]

9. Restore the database dump

mysql --verbose -h host -u username -p < dump_file.sql

10. Repair the tables for any problems that you have faced or would face.[7]

mysqlcheck --verbose -h host -u root -p --auto-repair --optimize --all-databases

In the mysqlcheck command:

  • The --all-databases option informs the mysqlcheck command to check all the databases stored in MySQL
  • The --auto-repair option informs mysqlcheck to automatically fix a table, if the table, when checked is found to be corrupted or has errors.
  • The --optimize option informs mysqlcheck to optimize the checked tables
  • The above options enable mysqlcheck to run the CHECK TABLE, REPAIR TABLE, ANALYZE TABLE and OPTIMIZE TABLE queries in a convenient way for the user.
  • For more information: Checkout the Mysql official doc for mysqlcheck.

11. From within your running application, Execute the following query: [1]

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

The output should be as follows:

+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.05 sec)

If the results are different, i.e showing latin1 any VARIABLE_NAME, then in the database configuration file of your application, set the encoding type to utf8mb4 or during application boot, execute the following query:

SET NAMES 'utf8mb4';

Nth Encoded Characters

After following the above steps, in an ideal world with a relatively small dataset, everything would be fine and your application would run smoothly, properly displaying the data. Let’s be realistic, in a real world user facing application, with a huge dataset, the possibility of saving double, triple, nth encoded characters exists.

To check for nth Encoded characters, simply check the tables in your database for rows with non-english characters and check whether they are perfectly displayed. If not, then you need to fix the nth encoded character data before proceeding ahead.

The explanation of how to recover from extra encoded is outside the scope of this article. To understand how to fix this problem, read Step 4 of Stephen Balukoff’s blog on MySQL character-set hell. [5]


Application Precautions

After breaking out of the MySQL character-set hell, there are a few precautions you should take in your applications. Many programmers and developers, while learning how to code get to know about the default string comparison functions provided by the Language (i.e. In C++/PHP: strlen, strcmp etc.). But many are not aware of the fact, that these functions work on the string provided, with an assumption that 1 byte = 1 character, which was true for ASCII being 1 byte but proven false for UTF-8 which is a variable length character encoding system. A fact to be kept in mind would be to use UTF-8 safe string comparison functions (in PHP — mb* utilities).

Conclusion

I hope that this article provides you with a way to come out of the MySQL character-set hell problem. As people say, hindsight is better than foresight, it definitely applies to the problems associated with MySQL character-set encoding. Also, the above article is not just limited to converting your MySQL database from latin1 to utf8mb4. You can follow the article to move between any two charsets and collation.

Found this post useful? Kindly tap the ❤ button below! Feedback and Comments are appreciated.

Regards,
Manish M. Demblani,
gridle.io

Sources

[1]: https://mathiasbynens.be/notes/mysql-utf8mb4#character-sets
[2]: http://mechanics.flite.com/blog/2014/07/29/using-innodb-large-prefix-to-avoid-error-1071/
[3]: http://aprogrammers.blogspot.in/2014/12/utf8mb4-character-set-in-amazon-rds.html
[4]: http://codex.wordpress.org/Converting_Database_Character_Sets#Converting_columns_to_blob.2C_then_back_to_original_format_with_new_charset
[5]: https://www.bluebox.net/insight/blog-article/getting-out-of-mysql-character-set-hell
[6]: https://docs.moodle.org/24/en/Converting_your_MySQL_database_to_UTF8#Linux_.26_Mac
[7]: https://mathiasbynens.be/notes/mysql-utf8mb4#repair-optimize
[8]: https://gist.github.com/mdemblani/34d30015f24e1927736f189b88be143a