Elixir for Rubyists: Charlists |> Binary’s |> Strings | IOLists

One of the first WTF? moments every new Elixir developer experiences is trying to figure out how the hell entering[97,98,99] in IEx comes out as 'abc'.

There are countless questions on Stackoverflow asking why this is happening because to most people, it makes no sense.

This generally leads to the realization that:

  • A. Elixir treats '' and "" completely differently
  • B. Strings in Elixir are not like they are in Ruby

Hopefully I am able to shed some light for those coming from Ruby on how exactly strings, binaries, bitstrings etc. work in the wonderful world of Elixir (and the BEAM).

‘ & “ are different things

In ruby:

irb> 'string' == "string"
# => true

In that ruby doesn’t care whether or not you single quote or double quote a string (except in the cases of string interpolation).

However, if you tried that in elixir, you would get the opposite answer:

iex(1)> 'string' == "string"
# => false

So what’s going on here?

The first thing you should know is that Elixir (like ruby) uses UTF-8 as its default encoding.

The next thing you should know is that Elixir treats anything between single quotes as a list of codepoints, called a charlist.

So basically, when you enter 'hello' , Elixir really treats this as a list of their corresponding UTF-8 codepoints, i.e. [104, 101, 108, 108, 111] . In fact, if you enter that list of integers into an IEx session it will output 'hello' .

So what are they used for?

Well from the Elixir Docs:

In practice, char lists are used mostly when interfacing with Erlang, in particular old libraries that do not accept binaries as arguments.

However there are some other uses for them.

For example, let’s say you wanted to create a small program that checks whether or not a word is an anagram of another word.

In Elixir this is pretty easy to solve:

defmodule Anagram do

def match(base, candidates) when is_list(candidates), do
base_fingerprint = fingerprint(base)
    candidates
|> Enum.filter(&(fingerprint(&1) == base_fingerprint))
end

defp fingerprint(string) do
string
|> String.downcase
|> String.to_charlist
|> Enum.sort
end
end
iex(1)> Anagram.match("dog", ["apple", "cow", "god"])
["god"]

That’s all the Elixir code you need to find an anagram of a word.

Let’s run through this line by line.

  1. match/2 is called passing in two arguments. A base word which we want to find anagrams of and a list of words to check (candidates).
  2. Those candidates are then piped through to the Enum.filter function. This function checks whether the fingerprint of a candidate word, matches the fingerprint of the base word.
  3. To check this, fingerprint/1 is invoked on each word. This method takes in the string to check, downcases it and then converts it to a charlist. E.g. “DoG” => “dog” => [100, 111, 103]. Finally, it sorts that list. [100, 111, 103] |> [100, 103, 111]
  4. This leaves us with two lists of sorted integers. If these two match, then the words must be anagrams of each other.

Try to implement something as succinct as that in Ruby! Edit: Probably should of made this rhetorical, nothing’s going to be Ruby in short syntax.

Ultimately, you probably won’t find yourself using charlists that much in Elixir, but it is definitely a benefit to be aware of them and cases like the above where you can make use of them.

Binaries & BitStrings & Strings

Now just to confuse you. Remember this from before?

'a' == "a" 
# => false

This is pretty much the same as saying 97 == "a" #=> false which makes sense.

Well in Elixir, this is true:

<<97>> == "a"
# => true

This is probably a tad confusing considering just above, 97 != "a" .

This is because a << >> is considered a binary in Elixir, which (from the lang docs)

A binary is a sequence of bytes. Those bytes can be organized in any way, even in a sequence that does not make them a valid string.

So basically, anything inside the << >> operator is just a comma delimited sequence of bytes.

A string is also sequence of bytes, the only difference between the two being that a string’s sequence of bytes happen to a UTF-8 valid sequence of bytes.

To see whether or not a string is valid UTF-8, you can simply run:

iex(1)> String.valid?(<<226, 136, 134>>)
# => true
iex(2)> <<226, 136, 134>>
# => "∆"
iex(1)> String.valid?(<<226, 16, 134>>)
# => false

You can see there, that the original 3-byte binary was valid UTF-8 (codepoint 8710, the triangle) , but when changing the middle byte from 136 -> 16, the binary became invalid UTF-8.

To see the binary representation of a string, simply open up IEx and call i(string) e.g.

iex(5)> i "dogs"
Term
"dogs"
Data type
BitString
Byte size
4
Description
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded codepoints in it are printable.
Raw representation
<<100, 111, 103, 115>>
Reference modules
String, :binary

Look at all that lovely information!

You can also use the above to dictate the size of the binary. By default, the numbers are stored as 8-bit integers. To specify that though you would write:

iex(1)> <<97::size(8)>>
"a"

If we entered a number greater then what’s possible in 8-bits, it overflows. E.g.

iex(1)> <<257>>
<<1>>

But if we increase the size to 16 bits we get the full binary representation:

iex(1)> <<257::size(16)>>
<<1, 1>>

What’s the benefit of being able to access a string in binary form?

Well if you remember from my previous post on pattern-matching there was this syntax for destructuring lists:

iex(1)> [ head | tail ] = [1,2,3,4]
iex(2)> head
# => 1
iex(3)> tail
# => [2,3,4]

This allowed us to grab the head of the list, do something with it then recurse on the tail.

Well, binary’s allow us to do that same thing with strings!

Let’s say you wanted to take a string, and remove all the vowels.

Below is an example using Binary pattern matching and recursion in elixir.

(I thought it better to gist this snippet for the nice colors)

There is a lot going on here.

Again, if you read my previous article, the best place to start with a tail-recursive function is the end case. This occurs when your function has nothing left to process and needs to terminate.

In our case, it is when we have gone through each letter and there are none left.

The next function head is where the binary-pattern matching occurs. Here we are matching head to the first byte (“binary-size(1)”) of the passed in string.

Now since this is UTF-8, we could potentially get letters such as æ, which are in fact two bytes, however for the purpose of this tutorial, assume that all strings fed into this machine are going to be [0–9a-zA-Z].

We then match rest to an arbitrary length of remaining bytes using the ::binary size descriptor. Note you can only use ::binary at the end of a pattern match and not the beginning or middle.

We then pass the value matched to head and send it on it’s merry way to the check_vowel/1 method, which returns and empty string if it is and the letter if it isn’t.

Finally it concatenates this value returned form the check_vowel method with the accumulator and calling the strip_vowels/2 method once again, this time passing in the rest value.

Running this in IEx gives us:

iex(1)> Binary.strip_vowels("Ned Kelly was an australian bushranger.")
"Nd Klly ws n strln bshrngr."

Now this isn’t that hard to replicate in ruby actually, e.g:

However, this doesn’t look anywhere near as nice and performs no where near as well as the elixir version. Edit: Without using Regex, mean’t to be an illustration of recursion (which is probably unfair comparing ruby to a functional lang).

We can even tidy the elixir version up some more to make it even shorter.

Since strings in Elixir are also binary’s we can also use them between the << and >> of a binary operator. This allows us to pattern match on specific values. E.g.

iex(2)> <<"Ned", rest::binary>> = "Ned Kelly"
"Ned Kelly"
iex(3)> rest
" Kelly"

So applying the above logic, our vowel stripping machine becomes:

There are plenty more examples of using binary’s in elixir and their benefits:

IO Lists

IO lists are another powerful feature for dealing with strings in Elixir. More importantly, for dealing with I/O operations and strings. Nathan Long from Big Nerd Ranch explains it well:

An IO list just means “a list of things suitable for input/output operations”, like strings or codepoints. Functions like IO.puts/1 and File.write/2 accept “IO data”, which can be either a simple string or an IO list.

But what is an iolist?

In it’s simplest form, an IO list looks just like any other list in Elixir:

name = "Harry"
IO.puts ["Hello ", name]
#=> Hello Harry

Now you may be asking what the benefit of the above is, over the below:

name = "Harry"
IO.puts "Hello " <> name
#=> Hello Harry
#...or
IO.puts "Hello #{name}"
#=> Hello Harry

From the outset, it might not look like there is much difference considering they all output the same thing, however, the way in which they reach that outcome if fairly different.

In order to do one of the above concatenation examples the BEAM has to do a few things:

  • Allocate Memory in order to create the new strings
  • Concatenate the two new strings together into a third string
  • Run the garbage collector to cleanup the old ones

Whereas, using an IO list the BEAM:

  • Doesn’t Concatenate the strings
  • Doesn’t create a new string to store the final result = less ram
  • Doesn’t have to garbage collect that final string = less cpu

As in the words of Nathan Long:

All the BEAM had to do was to ask the OS to copy each byte of data to the file.
The only place where the concatenation happens is in the file itself.

So essentially, instead of having the BEAM create the string, the BEAM ask’s the OS to grab these three chunks of memory and throw them into the file, or in this case, the Stdout (although correct me if I’m wrong here).

In fact, using IO lists is how the phoenix framework is able to generate lightning fast templates (when compared with Rails).

Further Reading

This is just the basics of dealing with strings in Elixir. If you want to know more, then I highly recommend checking out the following:

AND DEFINITELY WATCH THE FOLLOWING:

https://www.youtube.com/watch?v=zZxBL-lV9uA&list=PLE7tQUdRKcyYoiEKWny0Jj72iu564bVFD&index=16