Erlang String Handling

How to be efficient at handling string data in Erlang

The cat is out of the bag. Now, since WhatsApp had success using Erlang, more people will gravitate toward the platform. One of the common things that will happen as a result is that people who do not understand the string handling capabilities of the Erlang BEAM VM will fail to use it correctly. This costs a lot of unneeded clock cycles and this post aims to thwart it.

First of all, since this is rather VM specific, it also applies to Elixir users. The VM basically handles strings the same way, so you should be thinking in the same general direction as I am doing here, for Erlang. Though Elixir, to the best of my knowledge, maps its string type onto binaries.

The first thing to understand is that Erlang is a functional language first and foremost. This means, among other things, data will be immutable by construction and there is no way to circumvent that. It also means certain tricks pertaining to ephermeral mutability is out the window. And it would be outright against the spirit of Erlang to have mutable data in any form.

When you should avoid using Erlang in the first place

Depending on the problem, it may not be efficient to spend your time implementing the code in Erlang. Unless you plan to utilise some of the other Erlang strengths, like robustness, you may be better off by just implementing your system in Go or Clojure. Alternatively, you can write a NIF and then handle the core kernel in C. Don’t use Erlang when another tool solves the job in a way better and way faster way. A rough estimate is that C code is 20-50 times faster for tight computational kernels.

The Erlang language is built for handling more complex interactions where the raw execution speed matter less and the program structure matters more. If your system has no specific kernel which is executing for heavyweight processing, then Erlang might suit your needs. If your system is bound to memory or I/O then Erlang may be very suitable.

You goal is to get most processing out of the emulator loop and onto the BEAM VM core. The runtime core is very efficient at handling large amounts of data and many structures are built such that they can sustain heavy load. A good way to look at this is with the linux perf tool. Run perf on the beam.smp emulator and if beam_emu.c is spending too much time, chances are you are processing heavy. Even processing-heavy applications I have written tend to have 20% time spent in the emulator loop. The consequence is you can’t win much by replacing the interpreter with a JIT or a native code compiler. You can only, at best, squeeze the 20% so it is limited how much faster you can get the code by doing this.

Aside: A JIT would still be useful so more of the Erlang system could be written in Erlang rather than C. But for most code, it is not given a priori it would yield a speedup.

What is a string()?

[$H, $e, $l, $l, $o]

The overhead of this representation is massive. Each Cons-cell use 8 bytes for the code point and 8 bytes for the pointer to the next value. This means that the 5-byte ASCII-representation of “Hello” is 5*16 = 80 bytes in the Erlang representation.

Strings are shared. If we write

L = "World",

followed by

A = "Hello " ++ L.

Then A and L shares the tail, due to persistence and immutability. In some cases this can be utilised to limit the amount of data present in a system, and somewhat help with the massive data blowup you will otherwise see.

Erlang is not unique in having the string-as-integer-list representation. It is the same representation which is used by Haskell. But since no type information is present, you can’t discriminate between a list of integers and a string. You will have to know from the code at hand.

There are a lot of disadvantages of this representation. It results in memory blowup, more cache misses, more data bandwidth use, and so on. On the other hand, all the normal list-processing functions are present and can be used on strings alike. So to manipulate strings, you have access to four modules:

[string, lists, unicode, re]

The module string contains valuable helper functions for manipulation of strings. Most importantly:

  • string:tokens/2 which can be used to split a string into parts, and
  • string:join/2 which can join strings together with a separator.

I normally only use plain strings like these when the data sizes are rather small and we are not on a critical code path. Otherwise, you need to do something else in order to make things work quickly and efficiently.

The lists module contains many functions which are useful for working with strings. As an example, lists:prefix/2 can be used on strings as well.

The re module implements regular expressions. Here is the basic rule:

Do NOT use the re module

Usually, there are better ways to handle data than to supply a regular expression to handle it. Mind you, I utterly love using regexes in editors and I also like using them in lexers. But for most normal data processing, regular expressions are overkill. Also, the regular expression engine in Erlang has been altered so it preempts. This avoids a long-running regex to destroy the VM’s scheduling mechanism.

Typical code will have maybe one or two calls to the re module per 5000 lines of code. If you have more than that, chances are you are trying to program PERL in Erlang. And that is a bad idea.

However, the re module is way faster than the older regexp module which was deprecated.

The binary() type

  • A pointer to the original binary
  • An offset
  • A size

Note that sub-binaries work a bit like Go’s slice construction in this way. The VM is built such that passing around binaries and subbinaries are always efficient. The trick is immutability, which allows the system to pass pointers on binaries rather than passing the binary value itself.

Binaries, like in Go, also has extra capacity in the sense that in some cases a binary can be appended to efficiently without having to copy the binary data to a new place. The system will automatically extend binaries with extra capacity when they are copied, ensuring efficient append.

When programming Erlang, the compiler and VM will automatically generate binaries and sub-binaries for you. Write your code in a straightforward and readable manner first. Then compile your program with


to have the compiler report on which binaries were not optimised in code which is heavily traversed by the program.

Binaries can be pattern matched. This is extremely efficient, but sometimes you can’t write a matching rule since they essentially work from the beginning always. You can’t “search” in a binary until you hit something which matches by a single pattern match. The way to handle this problem is by using the binary module:

  • binary:split/3 is extremely powerful. It is the binary variant of string:tokens/2 but it is returning shared data and so does only produce a small amount of garbage. The split string simply refers into the original binary through sub-binaries. Be very aware of the option “[global]” which will allow you to split the binary into more than two parts.
  • binary:match/3 is your generic search routine for picking out parts deeply in binary data.
  • binary:compile_pattern/1 allows you to build some simple compiled patterns like a weaker (but way faster) regular expression search
  • binary:copy/1 forces a copy of a binary. This is useful if you have a 1 megabyte binary and you have found a 45 byte sequence in it—and you only want that sequence. Then you can copy the sequence which means you don’t hold on to the 1 megabyte binary anymore—in turn allowing the garbage collector to collect it. This is extremely useful if you are cutting input into pieces (with split/3) and storing it at rest for a long time. For instance in ETS.

The iodata() type

  • A list of integers in the range 0..255 is IOData.
  • A binary is IOData.
  • Lists of IOData is IOData.

In particular, this means you can form IOData by collecting IOData as lists. This means string concatenation in the language is O(1). Example:

p(IOData) -> ["<p>", IOData, "</p>"].

The p/1 function given here will wrap IOData in a standard paragraph section in HTML, but it will not reallocate any data, nor will it generate any garbage. The sections “<p>” and “</p>” are constants, so the only allocation that will happen will be for the front of the list, two list elements.

Most IO functions in Erlang understands IOData directly. This means you can avoid having to flatten data, but just send the IOData out over a socket or the like. It avoids costly allocations and copies on the IO pipe in your program. Highly recommended!

A good way to gauge how well thought out a library is is to look at how well it handles IOData. If it doesn’t and requires you to explicitly call iolist_to_binary/1 then chances are the library is not that well written.

Handling unicode() data

-module(z).-export([len/1]).len(B) when is_binary(B) ->
len(B, 0).
len(<<>>, K) -> K;
len(<<_/utf8, Rest/binary>>, K) -> len(Rest, K+1).

This is useful together with the ability to input character strings as UTF-8:

Erlang R16B03-1 (erts-5.10.4) [source-ce3d6e8] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]Eshell V5.10.4 (abort with ^G)
2> z:len(<<”Rødgrød med fløde”/utf8>>).

In Release 17.0, the default will be UTF-8 in Erlang files. This will probably have some deep effects on Erlang source code, but it will ultimately help getting Unicode into Erlang. Release 18.0 is planned to accept unicode atoms as well, to open up the design space.

Attack Vectors

Never never NEVER work on stringly typed data

When data is “stringly” typed, it means that your data has no structure and everything are represented in strings. This is fairly expensive to work on for a system since you are limited to use the string-handling code. Parsing is always expensive and this hurts your processing speed. Some languagues, like awk or perl are built to process these kind of things. But you rather do not want to do this processing in Erlang.

The way you avoid stringly typed data is to take the input and transform it as early as possible into structured data inside your code. That way, you can avoid working on the string data, and you only need it in a few places. The more structure you can attach to the data, the better.

The primary problem is when you have to work with a bad data format. Again, the trick is to turn the bad format into something sensible quickly and then process it as sensible data.

Erlang is designed to work with data that has structure. With structure, you can pattern match which is fast. Without structure, you have to rely on standard techniques and this is almost always going to be a pain in the language. So don’t do it. Convert data into a structured format and then proceed by handling the structure with pattern matches.


Erlang is immutable because it eliminates a large source of programming errors and programming mistakes. After all, the value of an incorrect program is lower than a correct one. This stance is highly unlikely to be changed, since the safety guarantee provided by immutability is part of the Erlang-DNA.

When you have control of data

The man page for inet

erl -man inet / setopts\( RET

describes common socket options you can set on a socket. By setting the packet option you can control how the system decodes inbound data. Most interesting, you can set ASN.1 BER encoding or Line-wise encoding.

If you value speed, you should consider an efficient binary format.


They differ in how they are implemented. The jsx library is implemented in pure Erlang and is the slower of the two. The jiffy library uses a C NIF (Natively Implemented Function) to run the encoder and decoder and is about 10 times faster.

Beware: the jiffy library can’t be used to decode large JSON data sets. The decoder is not a well-behaved NIF and as such it can mess up the schedulers if it is used to decode large data strings. If the JSON takes more than 1ms to decode, you should probably avoid using it. In Release 17.0, we get so-called dirty schedulers which can be used to work around this problem.

The other problem with JSON data is the internal Erlang representation. Right now, there is no isomorphic mapping for JSON objects/dictionaries into Erlang. This will be fixed in Release 17.0 as well, since it includes maps so you can obtain a native mapping of objects/dictionaries into Erlang data. I also have a side project on the run to properly handle JSON to Record encoding in Erlang, but this still ongoing work. And it will take some time before it is fully implemented.

On the other hand, note that JSON will never be a fast interchange format. If you use JSON to move large amounts of data, you are screwed. Plain and simple. You best bet is then to hope data are sent as many small pieces so you can use jiffy on them. Or wait till 17.0 so you can get jiffy in a dirty-scheduler variant.


erl -man file

Note that in order to have fast file I/O you need to open the file in “raw” mode, use binaries, and you can usually also benefit by following some of the advice in the section about performance. The general rule of IOData applies: If you supply IOData, the underlying file driver is able to map this onto a unix pwrite(2) call which is highly efficient.

Not opening in raw mode does have its benefits though, because you can then get the IO subsystem to work. This subsystem allow you to open a file on a foreign file system (on another node) and then operate on that file. If you don’t need the high speed, this is desirable in many situations, should your system span multiple nodes.

Closing off

But these are for another post, or another piece of code.

CS hacker, researcher, and investigator.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store