The string type is broken
My previous article, “We don’t need a string type“, caused a bit of stir. Though the feedback is mixed, there is a common theme of a string being a useful feature. After doing a bit more research I can determine only one thing: most current string types are broken!
Many of us believe our strings are capable of more than what they actually do. We rely on their functionality without actually checking that its valid. This can easily lead to programs which do not work correctly, particularly with respect to internationalization. In most cases it seems we would be better off without a string type.
Evaluate
I looked at how strings behave in a few basic situations. I’ll go over each situation, giving the expected result and some of the actual results. I considered showing a matrix with the results, but since all the tested languages behave so poorly it didn’t seem useful.
noël
Using the text “noël” with a decomposed Unicode string “noe\u0308l”, I checked the following:
1. Does it print correctly? Yes, most languages are capable of doing this. Though the ideone.com interface seems to break the output (so be careful with testing).
2. What is the reverse? “lëon”, correct? Mostly this fails. The most common result is “l̈eon” (the dieresis is on the ‘l’ instead of the ‘e’). This is what happens without a string class, by just reversing an array of code points.
3. What are the first three characters? Mostly the answer here is “noe”, as opposed to the desired “noë”. This could easily lead into a big discussion about what a character is, but I assume most people would not be happy with the current result. This is again indicative of a string type which merely treats the data as an array of code points.
4. What is the length? The common answer is 5. And yet again, this indicates our string types are merely arrays of characters and not truly handling the text.
For all of these questions, try to consider what should happen if you were editing this text in your favourite word processor or text editor. I generally expect that the ‘ë’ character is handled as a single entity. I don’t expect backspace/delete to just remove part of the letter. I expect copying the first three letters to include the accent.
😸😾
It was a bit weird to find out that Unicode has cats in it (I hope you have a font which shows them — if not, the title of this section is a happy cat and a sad cat, part of the Unicode emoticon set). These characters were chosen since they are outside of the BMP (basic multilingual plane). This spells trouble for languages using UTF-16 encodings (Java, C#, JavaScript).
1. Length? Python unicode correctly reports 2. Those UTF-16 languages tend to report 4: the characters require surrogate pairs.
2. Substring after the first character? Python unicode correctly reports the sad cat “😾”. The UTF-16 languages produce invalid strings with a half-surrogate followed by the sad cat.
3. Reverse? Python unicode gets the correct reverse of “😾😸”. The UTF-16 languages produce invalid strings. With C# I think I uncovered a defect in ideone. It doesn’t even show the invalid string and instead shows no output at all for the entire program! [ideone defect]
Languages using an encoding agnostic library, like C++, Perl, and normal Python 2 strings, fail here as well. They ignore any encoding and assume the string is an array of 1-byte code points. Python 3 adopted unicode as the default string type, thus fixing some problems. It appears that Perl also has a ‘utf8’ mode which fixes problems for these cats, but not for the “noël” string.
baffle
This string contains a ligature character, the “ffl” part is a single unicode code point. They exist mainly for compatibility, but they are a good test for case conversion.
1. What is the uppercase? I did not find any language which doesn’t print “BAfflE”. Notice the ligature remains lowercase. The expected answer is of course “BAFFLE”.
Unicode has a special class of case conversion: this single ligature code point is actually converted to three code points. By not following these additional rules, a language uppercase function produces an interesting result: a string converted to uppercase still has lowercase characters in it.
noël again
A final check I did was to compare two logically equivalent strings with different composition forms. Here “noël” is using the precomposed “ë” character.
1. Is precomposed == decomposed? The answer is no in all tests. However, several languages do offer Unicode normalization libraries. In those languages the normal form of the strings does compare equal. JavaScript does not have such a library, which is really tragic because it’s primarily a UI language, exactly where’d you want proper unicode functionality.
It’s tempting to argue that normalization and lexical analysis is not part of the basic string type. But these seem like fundamental operations one would want to do with text. If they aren’t included, what exactly is the purpose of the string type?
It’s broken
I encourage you to run such tests in your favourite language. If you are doing work with international text it is vital that you understand what your ‘string’ type is actually doing. Once you’ve run this you should reconsider what your “string” type is actually doing for you. In my opinion they’re all broken.
I admit the correct answer is not always clear. Text processing is a difficult topic, and at the very minimum we’d have to cover grapheme clusters (some string classes expose functionality relating to this, and Perl even has a GCString class). This is beyond the scope of this article, but very relevant for a good string type.
The point I made in my previous article becomes more poignant. I’d rather have an array of characters than a broken string class. I don’t put any false expectations on an array of characters: the results it produces for the above tests are very logical. Indeed, an array of unicode characters performs better on these tests than many of the specialized string classes.
Originally published at mortoray.com on November 27, 2013.