Playing with Strings
In this episode of my Erlang battle-story saga I will talk about something that bugs me a lot, but I’m aware how irrelevant all this might be to most erlangers out there. It’s just because I’m picky when it comes to semantics. But, anyway, here is my rant about Erlang strings…
The Many Strings of Erlang
Most languages have one or maybe two types/classes to manage strings. But not Erlang. In Erlang we have, well…
1> String = “this is a string”.
“this is a string”
2> StringToo = [$t, $h, $i, $s, $,, $\s, $t, $o, $o].
Strings are the original ones, just lists of integers with some syntactic sugar that are pretty printed as characters.
3> Binary = <<”this is a binary”>>.
<<”this is a binary”>>
4> BinaryToo = <<$t, $h, $i, $s, $,, “ too”>>.
Binaries are sequences of bytes. There is also syntactic sugar for this and you can read them.
5> IOList = [“this”, <<” is “>>, [“an”, <<” IO “>>] | <<”List”>>].
[“this”,<<” is “>>,[“an”,<<” IO “>>]|<<”List”>>]
6> io:format(“~s~n”, [IOList]).
this is an IO List
Binaries and Strings were fine, but more often than not you find yourself converting and concatenating them a lot. To optimise that there is a more relaxed way to build strings that can be interpreted by functions in the io module (and many others): IOLists. IOLists are (maybe improper) lists of strings, binaries, chars or IOLists. But, as you can see, since IOLists are lists, a single binary is not a valid IOList, although it might work well in the same scenarios. To accommodate for that fact, we have a broader type…
7> IOData = [“this”, <<” is “>>, [“IO”] | <<” data”>>].
[“this”,<<” is “>>,[“IO”]|<<” data”>>]
8> IODataToo = <<”This is IOData, too”>>.
<<”This is IOData, too”>>
9> io:format(“~s~n”, [IOData]).
this is IO data
IOData is basically just IOList or binary.
So, we have at least 4 different types of strings here, but luckily, each one of them has its own type specified by OTP. If you dig through the Erlang docs you’ll find these are all built-in types, defined as follows
binary() :: <<_:_*8>>.
string() :: [char()].
iolist() :: maybe_improper_list( byte() | binary() | iolist()
, binary() | 
iodata() :: iolist() | binary().
To the binaries!
Now, IOData (and consequently all the other types) works well with io functions. But, sometimes, you actually need a binary. At that point you have to convert whatever data you have to binary. And again, you have multiple ways to do it:
<<"this is a string">>
<<"this is a string">>
<<"this is a string">>
But if you check the specs for those functions you’ll find something odd…
-spec list_to_binary(IoList) -> binary() when
IoList :: iolist().
-spec iolist_to_binary(IoListOrBinary) -> binary() when
IoListOrBinary :: iolist() | binary().
-spec list_to_bin(ByteList) -> binary() when
ByteList :: iodata().
So, list_to_binary/1 actually converts iolist() to binary() and iolist_to_binary/1 converts iodata() to binary() (not even calling it by its own name). And the worst offender of all: list_to_bin/1 converts iodata() to binary()!!
As I said, I reckon I’m too picky with these things, but can someone blame newbies for being confused when they ask “hey! how do I convert this iodata to a string?” and they get an answer like “it’s obvious! You just have to use binary:list_to_bin/1”? And I’m not even talking about converting iodata to string! For that, you have no choice but to get yourself a binary first, and then turn that binary into a string… Yeah, don’t get me started on unicode, either.
What’s going on here?
This one is actually pretty simple to explain. If you want a detailed explanation you can check the history of OTP in git, but without actually doing that I’ll make an educated guess and assume that the functions and types were not defined in the order in which I presented them above. They grew organically and therefore, for instance, when iodata() was created, iolist_to_binary/1 was already accepting iolist() | binary() as its input and nobody changed its spec.
Something similar should’ve happened to the other functions, they were originally used as their name indicates, but then someone actually needed to extend support for other types of strings as inputs but didn’t want to break backwards compatibility.
In my mind, if I have a function called list_to_binary and I’m using it to convert lists into binaries but now I realise I have to convert iolists, too I don’t just add support for iolists to the existing function. I create a new one.
I think the added value of a consistent semantic is worth the cost of adding a new function and replacing the calls to the old one wherever it’s required.