Character Sets and SymmetricDS�

Mark Michalek
Data Weekly by Jumpmind
8 min readNov 15, 2018

Ever received a text that read like “□□□□□□?” If like me you have, you probably just experienced a character encoding failure. And if you worked in IT very long, you’ve probably run into character encoding problems with the applications you work on. Sometimes it shows up as missing characters, extra characters, and maybe most often with some catch-all “punt” character: the unwanted question mark (� or just plain ?), or those bricks “□□□” which express that the application has “no clue” what those characters should be.

Since SymmetricDS replicates between heterogeneous platforms, we get a lot of questions about character encoding and related issues. This article will very briefly sketch out the idea of character sets, review some parameters within SymmetricDS to deal with encoding issues, and wrap up with a few real-world cases we’ve seen of character set problems.

A Simplistic Overview of Character Sets

Character sets (and encoding) are a pretty deep and sometimes painful topic. This will be a simplistic overview, just to get us going and thinking about good ways to debug these kinds of problems. Here we go!

A character set is basically a table which maps a number to the visual character glyph that we recognize as a letter in a language, punctuation, or an emoji for example. So, a “text” file isn’t stored as “text” as non-geeks think of it, but as bytes (ie, numbers) which are interpreted to be text. And that interpretation step is put through the filter of the character set. On both the writing and reading end, computer software should specify and be aware of the proper character set and encoding to use. (And our code has to be aware of the character encoding we’re using, specifically. Character encoding defines the rules of how exactly that numbers/bytes of the character set get written out. For example, a character set may say that “A” is 65. But should that “65” be written as one byte or 2 bytes? Should the most significant bits come first or second? These are details of the character encoding rules.)

So when it comes down to it, all computer storage, processing, and transfer is done in bits and bytes, which are numbers. So the character set maps a given character to a numeric value which is represented by one or more bytes (when it gets encoded). For example, in ASCII: 65 is capital A (in decimal), 97 is lowercase a, capital B is 66 and so on.

Character sets define things like:

  • Which characters are available for display
  • How those characters are displayed graphically
  • What exact code maps to each character

Where things go wrong for developers and their character sets

The ASCII standard has its roots in the US going back to 1960. It covers simple latin characters such as A through Z and decimal numbers 1–10. Most modern character sets such as Unicode, and ISO-8859 variations are backward compatible with ASCII. So lots of character sets are the same from bytes 32 through 127. This explains, in practice, why data that uses things like decimals 1–10, and latin A-Z seem to come through OK when other characters do not. That also explains how the character set itself can be communicated early in a transmission with ASCII characters, such as “UTF-8” for a unicode encoding — for example in an XML document:

<?xml version=”1.0" encoding=”UTF-8"?>

Developers have to worry about a lot of details, and it’s easy not to think too much about character sets when developing new code. And things often will work fine for a while, because we’re testing with data that’s either ASCII compatible or compatible with our operating system’s default locale.

But then things go wrong when people start using our software: they start inputting crazy things like emoji’s, or typing crazy things like letters with tildes or accents. Of course, that’s not crazy at all, but if those characters aren’t common in our native language, then we have to do the extra work to think about them.

A simple rule for developers is at least to be aware of the character encoding in play when reading or writing textual data. Don’t rely on defaults which might work in your environment and test data but will break down in the real world. And when at all possible: use UTF-8.

A Tip: Get the Hex

Before diving into the details of character encoding in SymmetricDS, I want to offer one tip for debugging character encoding issues: GET THE HEX. Whether it’s a file or a field in the database, you can always get the hex value of the data, which allows you to basically view it as bytes, instead of as interpreted characters. That cuts out a lot of confusion. Here is an example on MySQL:

Dealing with Character Encoding SymmetricDS

SymmetricDS’ core functionality is to replicate data across different databases and operating systems, often crossing locale boundaries. This naturally leads to character set issues when it comes to the data.

There are a few hops the data goes during replication. The following diagram shows those hops as well as the 2 most problematic areas (with red stars):

sym_data

The write to sym_data is sensitive to what character encoding sym_data.row_data is set to (as well as the pk_data, old_data fields, etc.) . For some platforms, you will need to use the “N” variant of the varchar to support anything more than 1-byte characters. For example, you’ll need this for full Unicode support on MSSQL Server:

mssql.use.ntypes.for.sync=true

In SQL Server, a VARCHAR only supports an 8-bit code page. So if you need to use Unicode you need also change the types of your columns to nvarchar.

Get JDBC Encoding out of the Way

Many JDBC drivers also allow the character set to be specified as part of the URL. But there are some cases where the JDBC driver itself can get in the way of encoding data as it’s coming out of the database. The following parameter will cause SymmetricDS to read the raw bytes from the database, and then construct a Java string using the platform default charset (which is normally overridden to be UTF-8).

db.read.strings.as.bytes=true

Staging File Encoding

The reads and writes to the staging files are controlled with this parameter from sym_service.conf. We recommend always running with this as utf-8.

wrapper.java.additional.11=-Dfile.encoding=utf-8

HTTP Communication Always UTF-8

Finally, the HTTP communication of Symmetric is set to always use UTF-8 and so far that has always been successful (as far as we know… See HttpTransportManager.java)

Example of Character Encoding Fails

Troubles with Tilde’s

Here was a case where the character should have been à but came through as 3 separate characters: “�”:

Wrong: “N�O DECLARADA”
Right: “NÃO DECLARADA”

The user gave some details which talked about a server upgrade and showed configuration that had a setting like “-Dfile.encoding=windows-1252” So I would guess this is related to the older server having “-Dfile.encoding=windows-1252” and the newer server having “-Dfile.encoding=utf-8”. The file.encoding encoding system property in Java defines a default encoding for the Java runtime environment. For all intents and purposes, this should always be set to -Dfile.encoding=utf-8 in SymmetricDS.

Patriotic Pride

A user contacted us with this kind of error message where the target database would not accept the incoming data:

2018–06–26 17:22:30,562 ERROR [SERVER] [AcknowledgeService] [server-push-default-5] The outgoing batch client-2844 failed: [HY000,1366] Incorrect string value: ‘\xF1\xBE\xA1\…’ for column ‘city’ at row 1

Looking at the data in VS Code showed the problem — there was a flag emoji in the city column. This data went into the source database just fine, but the target database was not prepared to handle this character.

Troubles with Tilde’s Again

“I have a table that is out of sync and all of the records that are out of sync contain special characters.” Examples:

Right: Lãzaro
Wrong: Lã?zaro

Here was the response from a JumpMind support engineer:

“It appears that these characters may have in fact been encoded using the Windows-1252 encoding, rather than utf-8. However, if you take a look at the hex values from your source column: 4172726F796F204CE3817A61726F202020202020

Notice that the start of Lãzaro corresponds to 4C E3 81.

4C -> L,
E3 -> ã

However, 81 is not a valid encoding for any character in Windows-1252 (https://en.wikipedia.org/wiki/Windows-1252), and can sometimes be displayed as an empty string (like we see in SQL Server Management Studio) or the ? character that we’ve been seeing in Symmetric. If we remove the byte 81 from the string, we get the decoded value of “Lãzaro”, which appears to be your desired string.”

The point here is that different applications and runtime environments may handle encoding anomalies differently (such as the extra 81 byte in this case). Understanding the hex, character encoding, and character set were all necessary to get the bottom of this one.

Data Lost in Transit

Another user writes: “The data arrives on the China server as “??”

mysql> select * from product where id=32;| 32 | cases | 冒牌Apple | Iphone_X | 1 | 好 | {“name”: “PROD_apple_iphonex2_v.jpg”, “width”: “1080”, “height”: “1440”, “inner_width”: “0”, “inner_height”: “0”} | NULL | 0 | 0 | 2018–03–13 10:00:41 | 2018–03–13 10:03:13 |

In sender’s sym_data, the Chinese character has also been replaced by “??”:

| 20118 | product | U | “32”,”cases”,”??Apple”,”Iphone_X”,”1",”??”,”{\”name\”: \”PROD_apple_iphonex2_v.jpg\”, \”width\”: \”1080\”, \”height\”: \”1440\”, \”inner_width\”: \”0\”, \”inner_height\”: \”0\”}”,,”0",”0",”2018–03–13 10:00:41",”2018–03–13 10:03:13" | “32” | “32”,”cases”,”??Apple”,”Iphone_X”,”1",”?”,”{\”name\”: \”PROD_apple_iphonex2_v.jpg\”, \”width\”: \”1080\”, \”height\”: \”1440\”, \”inner_width\”: \”0\”, \”inner_height\”: \”0\”}”,,”0",”0",”2018–03–13 10:00:41",”2018–03–13 10:03:13" | 30 | products | 3.858432 | NULL | NULL | NULL | 2018–03–17 23:20:30

Here the user figured out what was going on. When Symmetric captures data for replication, that data gets captured to the sym_data table. At that point, the Chinese characters were converted to question marks. In this case, the DB in use is MySQL, and MySQL supports specifying the character set at the database, table, or even individual column level. This case mentioned above was probably caused by the sym_data.row_data column using a more limited character set such as a ISO-8859 variant, or by the need to switch to the nvarchar fields on sym_data.

Conclusion

Character sets can be a pain, but usually, if you at least consider them then writing IO data, you’re doing better than the majority of developers. I’ve shared here some basics, and examples we’ve seen in the wild of character set and encoding problems going wrong. The main two tips I can give are:

  • Use UTF-8 when possible
  • Check the HEX when things go wrong

What kind of examples have you seen?

For a classic article on character sets and encoding see:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

--

--

Mark Michalek
Data Weekly by Jumpmind

I started working on ‘big’ data when I used my first computer to catalog my hockey card collection.