Erlang: No Process Was Harmed in the Making of this Post

Erlang is a functional programming language designed to help produce reliable software that efficiently scales horizontally. Erlang gained a lot of publicity when WhatsApp published a blog post touting their 2 million Erlang powered tcp connections (2277845 to be exact) on a single server, with only 37.9% CPU usage. The cherry on top, is that WhatsApp achieved this in 2012 when they were only employed 30 people .

Although Ericsson began development on Erlang in 1986, the project wasn’t open sourced until 1998. Joe Armstrong, Robert Virding and Mike Williams are credited for Erlang’s creation. Fun fact: the development of Erlang was performed on the DEC’s Vax 11/750. The 11/750 had a CPU clock speed of 3.125 MHz. To give you an idea of the kind of computing power we’re talking about, the iPhone 6 is approximately 65536 times faster than the 11/750, and the Raspberry Pi 3 is about 3840 times faster. Therefore, from a computational power standpoint, it is safe to say that Erlang could run on any device today.

Ericsson is a telecommunications equipment and services company based in Stockholm. Telecommunication companies handle the transmission of signals - so anything from streaming last nights episode of Game of Thrones, to handling emergency “911" calls. In 1986, a large portion of Ericsson’s business was routing wired telephone lines. These systems needed to be extremely reliable, as users tend to be unhappy when their 911 calls don’t go through. Mike Williams defined their problem domain by this criteria:

  • Actions must be performed at a certain point in time or within a certain time.
  • System may be distributed over several computers.
  • The system is used to control hardware.
  • The software system is very large.
  • The system exhibits complex functionality such as feature interaction.
  • The systems should be in continuous operation over many years.
  • Software maintenance (reconfiguration etc.) should be performed without stopping the system.
  • There are stringent quality, and reliability requirements.
  • Fault tolerance both to hardware failures, and software errors, must be provided.
  • The system must be able to handle very large numbers of concurrent activities

Mike Williams and the other 2 engineers from Ericsson couldn’t find a language that they felt would allow them to efficiently solve these problems. After some prototyping, they decided to write their own language from scratch and Erlang was born. Erlang was named after the father of traffic engineering and queueing theory Agner Krarup Erlang and his Erlang unit which is defined as:

“The erlang (symbol E) is a dimensionless unit that is used in telephony as a measure of offered load or carried load on service-providing elements such as telephone circuits or telephone switching equipment.”

Erlang is all about being as small and efficient as possible, so being named after a unit of measurement seems fitting. Erlang is first and foremost a functional language and there are no classes, objects, loops, or even traditional if statements. Instead, everything is handled with recursion and pattern matching. This means instead of having a function like this:

You’ll get something that looks like this:

So what’s happening in the Erlang code above? In the first line a module is declared, named count_to_ten. The second line exposes a function within the module named count_to_ten, so that it can called outside of the module count_to_ten module. The count_to_ten function works by recursively calling do_count and incrementing the arguments. You may notice there are 2 do_count functions, do_count(10) and do_count(N). do_count(N) handles all the default cases but once a 10 is passed in as an argument do_count(10) is called (pattern matching). Instead of making another recursive call the 10 is then returned.

It’s also important to note that a lot of people find Erlang’s syntax to be less than friendly. Personally I agree that Erlang doesn’t have the best syntax, but syntax is a matter of opinion and judging by the fact that Erlang has been around for about 30 years Erlang developers eventually get use to it.

Another key aspect of Erlang is that it’s very immutable. Erlang doesn’t even have true variables, only constants. For example this is illegal:

This means every time a value change happens a new variable needs to be declared. Ok, immutability is cool and all, but why all the trouble?

Mutability makes concurrency hard.

Conventionally, languages have solved concurrency problems by using shared memory. This means two processes can read and write the same data. In order to make concurrency possible computer scientists have crafted all sorts of clever solutions and all of a sudden, developers had to be concerned about semaphores, mutexes, monitors, spin locks, critical regions, futures, locks, caches, threads, thread-safety, etc. The point being, making concurrent programs is complicated.

What happens in most concurrent programs is processes will take turns reading and writing data. If a process gets to a point where it’s editing some critical piece of shared data, then it raises a flag letting everyone else know that they need to wait their turn. Once the process is done they lower their flag and the next process can continue it’s work. What happens when a process fails to lower/raise their flag? Conventionally, this is when a program will freeze or crash. Erlang gets rid of these issues by making sure every process agrees to never alter any data. Now there’s no need for a flag.

However, agreeing not to alter data isn’t quite enough. Processes still need to work together and that requires some form of shared data/communication. Erlang’s solution to this is to give each process their own mailbox. When one process needs to share data they can do so by sending and receiving messages. Now, since no one is editing the data each process is free to do what they like with any data they have access to. In computer science this method of concurrency is often called the actor model. It’s a rather elegant solution to the shared data problem. It also means that processes can fail neatly. With the help of Erlang’s standard library, OTP, creating supervisor processes that can watch and handle process failures is trivial. Because no other process was harmed in the making of this failure, failures are isolated to the process. Often the solution is as simple as a process restart.

Everything about Erlang tightly constrains its developers, but this means Erlang developers are free to write code how they want. They don’t have to think about the whole software architecture, but just the function that they’re working on. This goes back to the Erlang unit. A function is the smallest unit of functionality in a software program. When developing in Erlang developers are free to focus on building these small units, and as long as they take the same parameters and return the correct type everything works fine. It’s similar to a modern day micro service architecture; which coincidentally was designed to increase reliability and elegantly handle failures.

Erlang has proven to be extremely effective at writing reliable software. Software reliability is typically measured in terms of availability percentage. “One nine” equates to 90% up-time. “Two nines” is 99% up-time. “Five nines” is 99.999% up-time and is only reserved for highly reliable system. 99.999% up-time means that over a course of a year a system is down for only 5.26 minutes. Even many of AWS’s systems don’t have that kind of up-time. Ericsson’s flag ship Erlang product, the AXD301, is an asynchronous transfer mode (ATM) that handles encoding and routing data at up to 160 Gbit/s. The AXD301 runs on 2 million lines of Erlang code and despite all this complexity, it managed to achieve nine nines reliability. Nine nines of reliability means that over the course of a year there were only 31.6 milliseconds of downtime. It would take 31 years would for you to get a full second of total downtime. AWS would kill for that level of reliability. In their 4 hour outage S&P 500 companies lost a combined $150 million.

Another key aspect of the Erlang programming language not to be underestimated is that it runs on the BEAM virtual machine, also referred to as the Erlang Virtual Machine. Erlang code is compiled to BEAM byte code and runs through the BEAM VM. Without the BEAM VM there is no nine nines reliability. The BEAM virtual machine handles all the scheduling and memory management. It’ll spin up the proper number of threads (4 threads on a 4 core machine by default) and decide which processes are allowed to run at any given time. Erlang processes are not the same as processes running on your operating machine. An Erlang process is extremely small, here’s an extreme example of a process that is just 236 bytes and took around 3 microseconds to spawn. To give you an idea of how small the processes are; when WhatsApp was handling 2.2 million connections on a single node each connection had to map to at least a single process. Therefore, we can estimate that WhatsApp had more than 2 million processes running concurrently on that machine. Running top on my machine right now and reveals about 228 processes and my chrome tabs are eating up 97% of my 8 GBs of RAM. I don’t think I could safely run 2,000 much less 2,000,000 OS processes on my machine. Point being BEAM/Erlang processes are fighting in the featherweight division.

The BEAM VM also handles garbage collection. Once a process finishes and dies off it’ll garbage collect all the data associated with that process (see updates below). This makes for relatively efficient garbage collection, but it is possible to have problems. Example being: if multiple processes have a reference to the same list that list won’t leave memory until every process with a reference terminates. Long story short like anything it is possible to break Erlang.

Closing Thoughts

Erlang is an interesting language, but it’s not perfect for every problem. Erlang has a very specific set of soft-real time highly concurrent problems that it is perfect for. I believe these types of highly concurrent problems are only going to get more common as the internet grows. Telecommunications seemed to be the only area that had to deal with massive scaling in the 80s/90s. Now as IoT is growing and everyone has a super computer in their pocket there are all sorts of problems that deal with telecommunication level of scale. If you find Erlang interesting it’s worth looking into elixir. Elixir lowers the barriers to entry to Erlang. Even Joe Armstrong agrees that Elixir is a well designed solution to Erlang’s shortcomings.

One of the features of Erlang that I didn’t cover was live code reloading. Erlang allows you to hot swap code without restarting your program, but from my limited research showed that it isn’t all it promises to be. It can be difficult to implement. However, the technology is there and it’s just a difficult problem. This makes me think a framework that brings hot swapping code to the masses will come eventually.

Disclaimer: I’ve never written any code in Erlang, but hopefully I did a sufficient job of doing my research; if I missed something let me know in the comments below.

Updates

April 12, 2017

I’d like to thank Gabriel Fortin for pointing out that my statement on garbage collection wasn’t 100% accurate. Here’s his correction:

The following is not fully accurate:
“Once a process finishes and dies off it [BEAM VM] will garbage collect all the data associated with that process.”
In fact, garbage collecting happens also while the process is still alive, every X number of “reductions”.

Interesting Links