We Learned Us Some Erlang

Reflections On an Unconventional Language Choice

Ransom Richardson
Tap the Mic

--

There’s nothing like the greenfield of a brand new startup to foster unconventional technical choices. When we started Talko three years ago, one of riskiest technical decisions we made was to build the backend service using the Erlang programming language.

We had heard good things about the language. For example, WhatsApp supporting one million connections on a single machine. We aspired to need to scale like that. Plus, Erlang was designed for telephony and we are building a voice communications system. It seemed to make perfect sense and we wanted to give it a try. The only problem was that none of us had ever written any Erlang code. But we love to learn and are always up for a great challenge!

Three years later we’re still using Erlang and loving it. It’s fast, scalable and reliable. We have just over 25k lines of Erlang code — which is wonderfully concise given that our service implements registration, contact matching, group and call creation and membership, near real-time communications, notifications, user awareness, audio processing and much more.

For those of you who haven’t heard of Erlang, here’s what the official site says about it:

Erlang is a programming language used to build massively scalable soft real-time systems with requirements on high availability. Some of its uses are in telecoms, banking, e-commerce, computer telephony and instant messaging. Erlang’s runtime system has built-in support for concurrency, distribution and fault tolerance.

Want some of that? Go learn you some Erlang for great good!

The Good

Concurrency

Prior to Talko I had developed a service using C#. I took Jeffrey Richter’s class on “.NET Threading in C#”. The basic idea being that threads are really expensive, so to build scalable services you needed to avoid blocking threads. And there are some clever ways to write code that looks almost sequential but doesn’t block.

A very astute co-worker wondered why threads need to be so expensive. Wouldn’t it be so much simpler if threads were cheap and you could write code that executed sequentially? Dream on, right?

A few months later I learned that is exactly how Erlang supports concurrency. The Erlang VM supports millions of lightweight processes running simultaneously. The code in each process executes sequentially. Since all state is immutable concurrently running processes can’t interfere with each other. Erlang has built-in support for the actor model which allows you to send messages between processes. Taken together lightweight processes and the actor model have given us the tools to easily write understandable concurrent code.

Error Handling

This paper looked at failures in distributed systems and found that

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.

In our use of Erlang we have had to write very little error handling code. Instead we just follow the Erlang “let it crash” philosophy. If a client sends data that we didn’t explicitly program the service to handle then the process that handles that connection will crash. This isn’t a big deal. After the crash the connection will be closed and the client can reconnect.

The lack of shared state in Erlang enables this “let it crash” behavior. Because there is no shared state we know that the process that crashed is isolated — it couldn’t have been in the middle of some operation that leaves shared state inconsistent. And with the actor model there are no synchronization locks so the crashed process could be holding a lock that blocks other processes.

Handling crashes in some of our processes that provide system services takes advantage of Erlang’s supervisor model to automatically restart the process which crashed. This takes very little code and results in a robust system.

The success of the Erlang error recovery model in practice is told by Joe Armstrong:

Erlang is used all over the world in high-tech projects where reliability counts. The Erlang flagship project (built by Ericsson, the Swedish telecom company) is the AXD301. This has over 2 million lines of Erlang.

The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable … but we did 9.

Why is this? No shared state, plus a sophisticated error recovery model.

And More

With Erlang’s support for concurrency and error handling we have been able to write simple sequential code, while getting the benefits of high scalability and reliability. But there are a number of other great features of Erlang.

State machines are in my experience a tremendously underused programming tool. In the past I’ve simplified a great deal of code by rewriting it to use explicit state machines. With Erlang, using state machines is often the path of least resistance, and the code is much more likely to be written the right way the first time.

The Erlang VM uses very little memory. This enables us to run our entire server environment (including at least three separate servers running Erlang) on a laptop. And our code builds very quickly, so our edit, compile, debug cycle is very short.

Erlang’s pattern matching is tremendously powerful and enables concise code. Erlang excels at parsing binary data. Here’s an example of how you can parse an IPv4 packet header and assign all the data to variables:

<<Version:4, IHL:4, TypeOfService:8, TotalLength:16,  Identification:16, FlagX:1, FlagD:1, FlagM:1, FragmentOffset:13, TTL:8, Protocol:8, HeaderCheckSum:16, SourceAddress:32, DestinationAddress:32, Rest/binary>> = Packet.

How could that be any easier?

The Bad

Small Community

Erlang has a great community, overall helpful and friendly, but it is just too small. As a result we’ve had to write code ourselves that would have already been written in most popular languages. For example, I wrote the Erlang client for DynamoDB and maintain erlcloud, the main Erlang AWS client library. We love to be able to contribute to open source code, but it is a lot of effort for a small startup, and we wouldn’t have had to write it ourselves if we used a more popular language.

Steep Learning Curve

Erlang has a reputation for being hard to learn. It is very different from most other programming languages. It requires you to approach problems in a different way, particularly if you aren’t used to functional programming and immutable state.

We have found that the way we develop in Erlang is very different from how we have developed in the past. For example we end up spending very little time using a debugger and make much more use of “printf debugging”. This is due to a combination of reasons. Partly because immutable state makes many issues easier to debug; partly because the ease of rebuilding and reloading code means printf debugging can be very quick and effective; and partly because the Erlang debugger is rudimentary.

The most difficult part of getting started with Erlang for us was figuring out the right tools and procedures to use to build, deploy, run and monitor our Erlang system. We’ve found a system that works for us, but it took a while and may be sub-optimal. (The details are beyond the scope of this post and may be a future topic — let us know if you are interested.)

Not the right tool for everything

We’ve found Erlang to be great for writing our core service code. But for scripting and operations we primarily use Python, which has much better support for string handling and a much larger set of libraries to choose from.

And even for some of our core code where we are committed to Erlang it sometimes comes up short. For example some problems fit very nicely into an object oriented pattern. On our service we have code to cache information about accounts and calls. In an object-oriented world these would share a base class which would implement the common code necessary to deal with things like cache lifetime. In Erlang it is possible to share code in this way, but it feels much less natural than it does in an object oriented language.

The Verdict

When we first tested the scalability of our service was when we first saw Erlang truly live up to its promises. On a modest sized EC2 instance we were easily able to scale to support hundreds of thousands of client connections. It was amazing how easy it was, but unfortunately after a couple of hours the server would crash. It took a number of days to track down the memory leak and we were never able to fully explain the memory usage behaviors we were seeing. The fix was simple: we just had to provide an additional option to the Erlang SSL library. This is somewhat typical of our experience with Erlang. It is extremely powerful but it can be hard to get all the details right.

So on the one hand Erlang has gotten a lot of the big, important things about service development right. It has made it easier to develop a scalable and reliable service than in any other language we have experience with. On the other hand we’ve had to do more work on libraries than we would have with a more popular language, and many things are more difficult and frustrating than they need to be.

Overall Erlang has enabled us to quickly build a service which is scalable and reliable. In addition, because Erlang is so different from languages I’ve used in the past, learning it and using it daily has changed the way I think about programming and made me a better developer. Despite the drawbacks I feel more productive in Erlang than I have in any other language and would definitely choose Erlang again.

Let me know if you’ve tried Erlang and how it’s worked for you. Contact me here on Medium or at ransomr@talko.com. I’ll also be speaking at the SF Erlang Factory Conference in March.

--

--