Meeting Julia, a great new alternative for numerical programming — Part I benchmarking

Computers… How do they work? What do they want? How many unrolled loops and inlined functions can fit all those hard disks?

This is a two-part article presenting the Julia language. Because its promise of C-like performance is such a big issue around it, we’ll make this a two-part article. Here I’ll just talk about performance and present results from a small benchmark I did myself. Later we will discuss more aspects of the language.


How is it even possible? Meet the LLVM

Basically the high performance exhibited by Julia is made possible by the LLVM project. LLVM is a technology suite that can be used for building optimizing compilers, and is notably used in Clang, a C and C++ compiler. Clang is like any other C compiler such as GCC, icc, Turbo C or MSVC. All of these can exhibit great performance, provided all suitable and architecture-specific optimizations are performed during the production of the final machine-level code. And the same is true for any other comparable languages like Fortran, Pascal, Basic, Ada and Algol 68.

What makes LLVM peculiar is that while most compilers are only built with a single language implementation in mind, LLVM was created to have a flexible

LLVM has a badass logo

front-end. Adapting a regular compiler to make use of its low-level optimization and code generation capabilities to implement new language features, or potentially a completely different language, can be quite difficult. In fact, I cannot really cite a single example, it’s easier to see languages that output C language as an intermediate representation (e.g. Cython).

The LLVM changes that by offering a reusable back-end for creating compilers of potentially very different languages. Clang is just a different front-end for LLVM. This really changed the programming language landscape after 2003. And there was a demand. You can see there were many attempts to create new compilers or interpreters based on the LLVM for languages created before that time, Python and Lua being examples. But these after-thought ports don’t seem to have gained much traction, although Numba seems reasonably successful. The story is different with Swift and Rust, two other new languages that show the power of LLVM apart from Clang and Julia.

LLVM enables language designers in a way similar to what happened to the JVM in the past decade, supporting the creation of many new languages different from Java such as Scala, Clojure and more recently Kotlin, not to mention the same attempts to re-implement 90s languages (e.g. JRuby, Jython). There are important differences, though: the JVM is a lot more opinionated, and the LLVM is not even a virtual machine. The similarity lies mostly in the fact that a sort of intermediate representation (IR) is created in both cases, over which the compiler can perform all the optimizations we desire, either in just-in-time (JIT) or ahead-of-time (AOT) fashion.

Julia, created in 2012, is perhaps the first language designed with numerical processing in mind conceived in a post-LLVM world, and that is the reason it can be so good. Unlike Rust or Swift, or Python for that matter, it was conceived with numerical applications in mind. And Julia is also unlike Python, Matlab, R or anything from the 90s when most tools for optimizing low-level and IR code were trapped inside silo-projects, difficult to reuse. LLVM changed that, and we are now reaping the benefits.

My benchmark proposal

Let’s start with a word about the experiment problem I picked for this benchmark. When going for a test like this some people, maybe under the pressure of time or to make their code generally accessible, often end up trying over-simple things like the Fibonacci sequence. Or even a silly “Hello world”. These are not a very natural tests for numerical processing benchmarks in my opinion, and will tell you very little about the languages you are testing.

Of course the ideal benchmark is just your actual application, but a more proper general numerical benchmark for me is something more complicated like LULESH, or whatever. At the very least you should have some large numerical expressions. Really a lot of non-trivial arithmetic. This is what is going to challenge a Numpy vectorized approach, for instance, because it

Leonard Euler

conflicts with the technique of storing lots of intermediate results in memory instead of doing a single pass through the input data. You should also put in some nested function calls to test for function inlining. These are the two main things I had in mind here.

One way to create such a complex expression is to just pick up an approximation of a transcendental function to evaluate. Good old exp(x). This is just what I did, I went all the way through the code of math.h on glibc, and picked up a piece of code used to calculate the exponential function. The code below includes the expression, and is the full program I ran for my benchmark in C.

The code in Julia reads much more like a “script”, with very few keywords or type annotations, even though they can be strictly enforced. It also includes nice handy things like the @elapsed macro for measuring the time for the benchmark.

The Python code relied on the ipython magic %time for the time measurement. I just didn’t find anything else to be as good, but it is clearly bad not being something from the real language… It kind of illustrates Python’s limitations here. The big expression can take a single float, but also works with Numpy arrays, much in the spirit of how you are supposed to use that library. Notice the Numba version is pretty much the same from this, but it has a lot of @jit decorators everywhere.

And finally we have the Scala code. Notice the clear use of map and reduce at lines 40 and 43! A thing of beauty.

To recap, what we are doing is calculating an approximation of the exponential function, then comparing it to the standard library exp, for many numbers between 0.4 and 1.0, where this approximation is valid, and then keeping the largest absolute error. We test from as few as 10³ points all the way to 10⁹. The resulting error in practice is something negligible, around 1e-14.

Results

Now let’s just jump into what you’re all really interested in seeing: the benchmark results! This graph shows the speed (quantity-per-time of input points calculated) for different input lengths and language settings. Here we just want to illustrate how weird cases compare to the core of results with reasonable performance, and also how some methods cannot actually cope with large inputs because at some point they produced large concrete arrays storing intermediate values unnecessarily. These are the ones that stop at 10⁷, except for the pure Python one where I just stopped because I was tired of waiting.

Benchmark performance for different languages and input size. In some cases, the subject could not even handle a very large input. Native Python shows a drastically sluggish performance at the bottom and a buggy C code that did not really calculate anything at the very top.

Now this bar chart finally shows a direct comparison of the cases that matter the most. We can see that Julia really attained a performance on par with C, with a small, daresay negligible three-percent advantage to the Jurassic contender.

Relative speed of different languages, settings and implementations.

Discussion

The first thing to notice from these results is the great performance exhibited by Julia. It really is true it can attain a performance comparable to C, which seems to remain the gold standard. But there are many other interesting facts to take away from this experiment.

I tried two different approaches for the loop in Julia, one an explicit for-loop, and the other using a more functional fold-left. The functional approach was very close to the procedural, what in one way is nice to see if you are enthusiastic about functional programming. On the other hand, it is still disappointing: there is not good reason the two approaches cannot exhibit the same performance. But I would bet some lesser-priority optimization patch in the future can still deliver this to us.Or maybe I missed some setting.

For Scala the situation was worse to FP enthusiasts: first of all you need an explicit iterator to prevent the map over the range to produce a concrete vector in memory. And then a while-loop was still much better than the map-reduce chaining. But in the end it was amazing to see what a great performance Scala could deliver. The JVM really is capable of producing high-performing machine code, as benchmarks demonstrate again and again. And we even used the boxed Double type here, I was actually afraid at first that this could have dire consequences.

If you disregard the fast cases, including the three three last bars, what we see is that Scala was actually the clear winner, followed by Julia, then Python with Numba, and only then we see C again. What is going on here?

These faster cases all required the use of a fast math option. This enables the compiler to utilize special instructions such as vfnmadd213sd from the the FMA3 family, a fused multiply-and-add. These are sort of SIMD instructions, not exactly like the obviously parallel instructions from the SSE family, and they actually may produce results with a few bits of precision less than the more strict slow math is supposed to. That is why it must be enabled explicitly with a decorator in Julia, and the -Ofast flag in GCC and a decorator argument in Numba.

Our benchmark C code disassembled by dissy, showing some FMA instructions.

Scala is probably not using SSE or FMA instructions, and this probably requires some special flag to be turned on in the JVM, assuming the option is even available. I unfortunately do not have the knowledge about this, and I am focusing in Julia at the moment, so I did not investigate it further. But it may be possible that just passing a -Xfastmath flag to the the JVM might actually make Scala jump closer to the podium here. This made a huge difference for all other languages, after all.

It is interesting to see how the C versions actually exhibited a wide variety of performances. The -O0 code was pretty bad, even losing to the second best Scala version. This makes sense because the C compiler is by default very conservative about optimizations, while the JVM has more aggressive default settings. Using -O2 brings the code up to speed, and then one single extra option, -finline-functions made a huge difference. I tried full -O3 too, but it did not improve much further from that. Only moving to -Ofast -march=native -funroll-loops really took the next step, supposedly allowing for optimizations that were already in place in the non-fast Numba, Julia and Scala versions.

For Python, it is clear that native code just doesn’t cut it. Numpy helps out, but you need something like Numba to achieve top performance, although still lagging a bit behind alternatives. If you really need to stick to Python, Numba seems like a good option.

Conclusion

When I was a young programmer, it used to be the case everybody thought compilers were not good enough, and you really needed to get into assembly language to produce high-performing programs. Architectures and compilers became more complex since then, and trusting a good optimizing compiler tends to be a better option to achieve high performance, apart from being aware of things like fast math, SIMD parallelism and caching, not to mention higher-level parallelism like multi-threading.

Welcome to Julia!

These special optimizations lurked within custom compilers for a while, constrained to the tightly coupled implementations of specific languages from GCC, icc or the Java compiler. Things have changed in the previous decade, though, and projects like the LLVM and the JVM itself are enabling new languages to be implemented as a font-end over a highly efficient compiler back-end optimizers, machine-code generators and virtual machines.

Some great examples of languages and compilers created with this front-end approach include Clang, Swift, Rust, Scala (on the JVM and LLVM), Clojure and even things like Iron Python and Elixir (on the Erlang VM). None such languages ever directly targeted numerical processing and scientific programming, though. Julia is the first to fit this niche.

Julia is an LLVM based language developed with high-performance in mind, and it really seems to deliver. We will discuss high-level aspects of the language in a follow-up article. If you are just interested in knowing whether it is true that there is a new language around that can deliver great performance for numerical processing, all I can say is that I looked into it, I tried it, and I am convinced of that. If you are interested in checking out what a modern language can do for scientific computing, I can only recommend you try out Julia today!

You can even just play around with a notebook online , how great is that?

The Julia set in the Julia language