Erlang : Pi2 ARM cluster vs Xeon VM
A low cost, energy efficient ARM cluster prototype proves it could be a viable alternative to a more traditional platform for a real-world Erlang/OTP server application.
What follows the introduction is a description of the platforms used, a short word on the application that was used to compare them, the resulting graphics and observations and a conclusion.
The release by AMD of a 64-bit ARMv8 CPU, the Opteron A1100, piqued my interest. Hardware platforms have become so powerful and costly they need to be virtualized on multiple levels (computing, storage, network, …), as much layers of added complexity, in order to use them in full. This makes the idea of using simple low-cost and energy-efficient hardware particularly refreshing.
With the right software and tools to handle load-balancing, fail-over and other challenges of distributed computing, that idea has become a perfectly working prototype.
While development boards using the Opteron A1100 are already rolling out, I settled on the Raspberry PI 2 (model B) as a cost effective test-platform. It runs the server OS (Debian) perfectly. And considering the 32 bit ARMv7 CPU (the same one as in my smartphone) is an earlier and slower architecture which doesn’t even need thermal dissipation, it fared pretty well.
I should note that the application’s responsiveness was absolutely satisfactory on the raspberry cluster for regular use, despite the platform’s weaknesses. Given how the test turned out, I’m almost convinced it could handle our current production workload without much trouble.
Raspberry Pi 2 Cluster
Single Raspberry Pi 2 specifications :
- CPU : Cortex A7 quad@950 MHz, 1GB RAM, slightly over-clocked
- Storage : 32GB Samsung EVO+ MicroSD
- Cost : ~30€
- Power consumption idle : 3,33w load : 4,8w
- 100 Mbps Ethernet*
The whole cluster (4 Pi2’s : 1 Apache load balancing proxy, 3 for application cluster) :
- Cost : ~200 €
- Power consumption idle : 13w load : 18w
*The Apache proxy, which has 2 Ethernet ports, also acts as DHCP, DNS, and router (NAT) for the other cluster members, the cluster running its own internal network.
DevX is the application development server, and runs in a VmWare virtualized HP Bladesystem environment with StoreVirtual iSCSI San storage in an air conditioned room, and benefits of multiple 24/7 hardware support contracts. Specifications :
- VM: 2x 2,4Ghz cores (Xeon e5620), 3 GB Ram
- Storage : 50 GB iSCSI
- Server blade power consumption* : 250W (not included: storage and air conditioning)
- Server blade cost* : around 5000€ (not included: support contracts, storage, blade enclosure and air conditioning). RCP of the CPU alone : $391.00
*The server cost and consumption should be adapted to the number of VM’s running on the hardware (6)
The test application
The application used in this comparative is a real-world server application : the e-justice platform of the Belgian Supreme Administrative Court. It’s based on Erlang/OTP, and can be run both in single-agent or in distributed multiple-instance mode. The test scenario uses standard operations offered by the application, but is not representative of how a real user would act (being much faster). It is also heavily dependent on the database (Mnesia), which is why a simple compute test has been added.
The application (including it’s database) hasn’t been modified in the slightest, besides recompilation, to run on the ARM instruction set or on a resource-limited hardware platform. And most of the faults the stress-testing brought to light (such as the application drowning in high numbers of concurrent requests) have already been corrected, but that’s a subject for another article.
Particular considerations about both environments
+ Application runs in a single Erlang node on each cluster member;
- Very slow storage. Unforgiving random read & write benchmark : respectively 1'819 and 855kb/s;
- Slower networking, and the fact that the DB, Mnesia, has to synchronize itself between cluster members, which causes it to experience transaction backlogs due to overload, which are aggravated by the slow storage.
+Native environment on which the application was developed;
+Fast storage. Unforgiving random read & write benchmark : respectively 27'577 and 11'062kb/s;
+Single agent : no synchronization with other instances, no networking;
-Application is divided across 2 Erlang nodes (mochiweb server and application core) which adds some latency.
Comparative performance tests
Testing methodology and concepts :
Cycle : a single cycle involves several serial operations such as object creation, browsing, modifying and deletion. It reflects what a human user might do, but at a much greater speed.
Agent : An agent executes cycles. Once a cycle has successfully been executed, a new cycle is started. Multiple agents can be executed concurrently. I started each test with 1 agent, doubling the count until reaching 128 or 256, depending on timeout occurrence.
The python multi-mechanize performance test framework was used to run the tests, generating HTTP queries directed at the target environment’s web-facing Apache proxy. A full test lasts 120 seconds, and starts all it’s concurrent agents (if any) within 10 seconds (this probably favoured DevX a little).
Performance comparison between PI2C and DevX
This is a direct performance comparison between both environments, where each one shows it’s strength : pure CPU speed for DevX under light load and load balancing for PI2C under heavy load.
- When there’s only a single active agent, DevX’s fast CPU has a clear advantage. The task cannot be split between cluster members, so there’s only 1 out of PI2C’s 3 CPU’s working at any given time;
- With only 2 cores, DevX starts plateauing (100% load) at 4 concurrent agents, while PI2C, with it’s 12 cores, only plateaus at 32, steadily improving its performance on the way, achieving more then 70 % of DevX’s performance under full load;
- At 128 concurrent agents, both environments start experiencing timeouts : queries take too long to execute and are discarded;
- At 256 concurrent agents scores start to be negative, which means even the first cycle of certain agents failed to finish (this provokes a negative score due to peculiarities with the testing framework).
Here can be seen how PI2C approached the performance levels of DevX : the amount of cycles each agent was able to perform remains stable for a longer time than for DevX.
- From 4 concurrent agents onwards, DevX is experiencing a linear diminution of successful cycles, halving its score at each doubling of concurrent agents;
- PI2C shows a much more gradual diminution of cycle counts, which gets linear at about 16 concurrent agents.
PI2C : Influence of cluster size
For this test only the PI2C cluster was used. Each line is the same environment but with fewer cluster members, the other ones being deactivated.
- The fewer the cluster members, the faster the application initially is. This is due to the fewer database replication and synchronization operations needed;
- Once past a certain load threshold, the advantage of load balancing kicks in, as the workload is being distributed among more cluster members. In other words : the application can scale;
- Increase in performance is not linear, it does not double when the amount of cluster members double, because of the performance costs of database distribution and replication. This is a known behaviour of Mnesia, and just has to be taken in account (just like the Whatsapp folks did).
Pure compute test
In order to show some pure computing performance, I added a simple Mandelbrot fractal generating function. A cycle becomes single a calculation, liberating PI2C from it’s slow storage medium, allowing it to unleash it’s raw, unrestrained, maddening processing power.
Multiple environment configurations are shown here. PI2C is shown with none or some cluster members deactivated, and there’s a DevX-under-steroids with double the core count.
- The linear performance increase shows nicely how the load is shared among the cluster members and cores (which is expected, this is just a simple calculation);
- PI2C outperforms DevX once 32 concurrent agents are active. To be sure this wasn’t due to some unknown bottleneck (testing PC, network, proxy, …), the same test was run on DevX using 4 cores, effectively doubling it’s score: no bottleneck;
- Notice the better performance of the 4 core DevX used by a single agent versus the 2 core DevX. 4 cores don’t make DevX work faster, they are still at 2.4 GHz: the difference is just Erlang and the application spreading the load of a single query over more cores (parallel programming). This makes 8+ core ARM CPU’s even more interesting;
- PI2C can only spread the tasks of a single query over a single cluster member’s CPU. That’s why each different configuration has virtually the same score when testing with only 1 agent. Same is true for two agents and both configurations with at least 2 members;
- The almost perfect linear increase in performance along with core count or cluster members shows how scalable the Erlang VM is (nothing new here).
Mnesia transaction modes test
This is more of interest to Erlang users. Mnesia, by way of mnesia:activity/2, can be accessed in 4 different ways : with or without transactions, synchronous or asynchronous. Each method has it’s benefits, basically a performance vs consistency trade-off. PI2C experienced some errors in asynchronous modes, caused by the transaction backlog that grew so large it failed to keep consistency (as to be expected).
The difference between synchronous or asynchronous modes is clearly irrelevant for DevX (no cluster members to synchronize with), while PI2C exhibits the expected behaviour of performance vs. reliability trade-offs. Without the added weight of transactions and synchronicity, in light-blue, PI2C is able to almost reach DevX’s level of performance (before going down in flames).
- The only noticeable difference for DevX is between « dirty » operations (without transactions) and transactions (with table locks, rollbacks in case of failures, …);
- PI2C DB consistency error rates reached 1-in-100 in async_dirty and 1-in-500 in asynchronous transaction modes (none in synchronized modes, as expected);
- In a cluster there’s a performance hit when synchronicity is enforced, although that difference mostly disappears under heavy load, where it helps avoiding remote node database overloads and guarantees consistency, which avoids useless operations.
With faster storage, the PI2C would certainly have a fighting chance if put against a “traditional” virtualized server, for a fraction of the cost and energy usage. I failed to mention it, but the test application has some serial bottlenecks, which can not be parallelized, such as interactions with a Prolog socket server, which must have had a negative impact on the cluster’s performance.
The prototype has been able to demonstrate the performance and even the viability of a very modest ARM cluster. I’m quite confident more modern and powerful ARM chips (they keep on coming), with serious storage and networking capabilities, can justify their presence in a server room, introducing a new era of low-cost and energy efficient server rooms, who, depending on where you live, don’t even need to be cooled any more.
ps: For those who’d say “Just go cloud”: we opted for not storing non-public data (these are court proceedings, remember ?) on the cloud.