My first Haskell program and how easily I optimized it from 33m to 17s.
Few weeks back, my colleague at work, Eyal Kalerdon, shared his implementation of ray-tracing-in-one-weekend in Rust, which inspired me to try this as well. I, however, chose Haskell for this, thinking I’ll not only learn ray tracing techniques but a new language as well. Haskell has been one of my favorite language for many years, as I often read about it but I never got any opportunity to actually write code in it.
Anyway, the implementation is pretty straightforward, as my only job was to port the C++ implementation from the article (which we both followed) to Haskell.
Parameters And Assumptions
In this article, I’ll execute a program called ray-tracers which is parameterized by 3 arguments, to be passed on the command line and all other parameters (like aspectRatio, the camera settings) are hardcoded in the program. The executable ray-tracers takes the following arguments:
$ ./ray-tracers <image-width> <samples-per-pixel> <rays-per-sample>
We’ll try different values of image-width, keeping the other two parameters fixed, viz. 100 and 50 respectively, in all the experiments. Please note that image-height is computed by the program as:
imageHeight = <image-width> / aspectRatiowhere, aspectRatio = 16.0/9.0 # hardcoded in the program
The First (Naive) Version
The first version, ray-tracers-0.1.0, was naive though in the sense that I consciously didn’t make any attempt to optimize any part of the code (though it is compiled with -O2), e.g Vec is implemented in terms of list simply because standard list operations can be used to implement other functionalities quickly and so on.
I made these design decisions to see how well the naive implementation performs and to get an idea of the changes required to make the program perform really really well. I also wanted to compare the readability of the performant code with the naive implementation. So let’s see how well it performs.
Kaboom! Turns out that our naive program hit the default stack size with image-width set to 100, and I found that GHC has an open bug which says that +RTS -K2M -RTS wont really increase the stack size to 2M. So we have to work within this limit. Let’s run it a couple of times with lower image widths.
As we see the pixels per second statistic stays roughly the same for different values of image-width. Based on that we can extrapolate that 384x216 image would take 33mins roughly which is too bad — even though it is compiled with -O2. So in rest of the article, we’ll see how this scenario improves. 🤞
Our goal is to optimize the code gradually, in step by step manner and see results at each stage — I’ll also share the commit so that you can see the diff at each stage. We’ll use 384 100 50 as command line arguments to our program when comparing performance. However, for inspection and other purposes, we might use smaller sizes so that the program runs quickly.
Machine and Environment
I’ll run all the tests on my MacBook Pro which has these configurations:
Apart from that, I use GHC 8.6.5 and cabal 126.96.36.199.
Use RTS to inspect the memory usage
One of the obvious reason why the program is slow is because it’s currently single-threaded. But it’s not very wise to make a program multi-threaded simply because single-threaded performance is poor. If the single-threaded program is consuming too much memory, then the multi-threaded one will consume even more memory in a shorter time span further degrading performance.
So I added
-rtsopts to ghc-options. It’ll enable us to control the behavior of the runtime system (RTS) when we run the program. So I rebuilt the program with
-rtsopts and ran it, passing
+RTS -s RTS as arguments to the runtime system. The remaining arguments were passed to our program. The option
-s will tell RTS to show some useful stats on the terminal itself as we can see below.
The most interesting info is this line: 1653 MB total memory in use. This is the actual memory consumed and used by our program. 1653 MB is too high for 40x22 size image, isn’t it? Apart from that, also note that more than 4s is spent by the GC — which is almost 20% of total time taken by the program (20s). That actually makes sense — the more memory you consume, the more work the GC will have to do to reclaim it..
Laziness can be a curse sometime: Thunks
Haskell is a non-strict language — most of the expressions are not evaluated right way. Instead the compiler creates thunks to represent the unevaluated expressions, as a result of which the composite expressions, such as
a + b * c — d, do not reduce to their values. Rather, they make graph like structures, and compose further to become even bigger graphs, consuming excessive memory and slowing down the program in the process. That explains 1653 MB memory usage figure we saw in the previous section.
Anyway, lazy computations are good if it is not known that the result will eventually be used. But if we know that a result will be used, then laziness could be a curse. In our case, it is!
So let’s force expressions to evaluate expressions fully, so that we don’t consume too much memory. To do that, I used deepseq library which provides many functionalities, one of which is aptly named as force . It is simple to use: just put
force in front of the expressions you want to evaluate.
These two functions are where we create almost all our thunks. This is good as we have a few places at which to make edits — basically 7
force and 3 more minor yet impactful additions, and then the result (for the same width) is:
Now the memory got reduced to 2M only! That is unbelievable! Also note the time taken by GC, and the productivity reached 97.8% which is very good. Apart from that there are little improvements in time elapsed and pixels per second as well.
In fact, now I can even render 384x216 image:
Almost 98% productivity, 8M memory, and it took 1422s (i.e 24m) which actually makes sense, as there is some improvement in pixels per second. So we improved from 33m (extrapolated) to 24m.
You might be wondering as to the actual size of the changes to achieve this improvement. Well, see for yourself in the link for the commit.
Parallelism and Threadscope
As the memory usage is under control, now we can think of using threads and parallelism. But before that lets enable tracing and see what threadscope has to say about our program behavior without threading. To enable tracing, add
-eventlog to ghc-options, rebuild and then pass
-l to RTS at runtime which should create ray-tracers.eventlog.
Now open the generated eventlog file with threadscope as:
$ threadscope ray-tracers.eventlog
which will show this:
It shows two green bars. As the label says, the thicker bar is about activity and the thiner is about HEC 0. So what exactly are they? Well, HEC stands for Haskell Execution Context and for our understanding we can just say 1 HEC means 1 core. I have 4 cores on my machine with hyper-threading enabled, so I should have 8 HECs, yet one HEC is working, which makes sense as it’s single-threaded. Next, activity is just cumulative effect of all cores. Our interest area here is : HEC.
So let’s add parallelism and for that, do this:
- Add this awesome parallel library.
- Rebuild and then run passing
-Nto the RTS. This options tells RTS to use all available cores (4 on machines, 8 HECs). You can control the number of cores to be used by the program, like
-N2which will use two cores. In my case,
475s, i.e almost 8m. That is a huge improvement:
- 33m → 24m → 8m 🎉
So how big was the change? Well, it is practically this:
That is it. See this commit yourself.
By the way, GC is taking almost same time which is good, but productivity has decreased from 98% to 92.5%. That, however, can be explained by possible lock/thread contention as it’s a multi-threaded program now. Let’s take a look at the eventlog in threadscope:
It looks good. All cores are busy doing work (green) whilst occasionally being occupied by GC waiting (light beige) and GC working (orange). The above isn’t visible enough. So select an area and zoom in repeatedly until we see this:
As we can see, the long light beige bars are GC waiting. Not sure if we can still improve that. But it is amazing to know a tool which gives us such a detailed view of the runtime.
Avoid List in Performance Critical Code Path
In Haskell, lists are actually linked-list — in other words, node-based data structures. Node-based data structures are usually slow because they’re cache-unfriendly — and thus should rather be avoided in tight loops. Instead, if we use record-based structure, like this:
then suddenly we see this:
Almost 93s, i.e 1.5m. That is a huge jump!
- 33m → 24m → 8m → 93s 🎉 🤓
Apart from that, productivity is back to 97.5% again. GC is taking 2 secs only!
The commit can be viewed at here.
Bang! Bang! Strictly!
Haskell allows us to enforce strictness right in the declaration of a type as well. So we don’t always have to rely on things like force, deepseq, and seq. Such strictness is declared with an exclamation mark
! — and it is called bang declaration.
With this, our
Vec now looks like this (view commit) :
and we got a little performance benefit with this (most of the other evaluations are enforced by
force itself tough).
So now the score is:
- 33m → 24m → 8m → 93s → 82s.
Accumulating Parameters & Tail Recursion
Yes, that is right. Accumulating parameters facilitates tail recursion. It does not always seem to work, but with a combination of other factors, it sometimes kicks in, providing a sudden performance boost. The compilers and optimizers can be that unpredictable.
Anyway, with this (see commit), we get this:
And the score is:
- 33m → 24m → 8m → 93s →82s → 82s (no change).
No change. But I hope it kicks in later on.
Now we seem to have reached a juncture where we do not see anything obvious which can be optimized further. I guess now profiling will help us to find hidden things which are still very expensive in terms of time, and possibly allocations. Our priority, however, is time.
In Haskell, profiling is enabled by various approaches:
-prof -fprof-autoto GHC (ghc-options), as per this.
--enable-profiling --enable-executable-profilingto cabal v2-build as per this.
I used the second approach, and then passed
-p to RTS at runtime as usual. I also chose an image-width of
20 , as profiling-enabled program is too slow and we’re not really interested in the performance as such, rather profiling data.
./ray-tracers +RTS -p -RTS 20 100 50 produced a file called ray-tracers.prof which contains very granular data for each function, and at the top, it shows the summary consisting of few most expensive functions. So I got this:
In our program, there are two implementations of the function hit — one is implemented for Sphere and other is for HittableList. In the above report, Sphere’s hit turned out to be most expensive. So if there is anything we need to optimize, it is this function. Now let’s search hit in the same file, to look at the details of this function:
Here the selected line shows the details: the function hit has been called 28853776 times (the fifth column from the right) which implies that even if we could improve this function by some fraction, the resulting impact on the overall program would be huge, intuitively speaking.
So I made the following improvements in the code (see commit).
- Use INLINE aggresively
- Use BANG pattern aggresively
- Use custom loop instead of foldl’
- Use Data.Array and tuple instead of list (which is expensive to traverse).
- Use random >= 1.2 which is significantly faster compared to random-1.1, as now StdGen (which we use) is based on SplitMix.
And couple of other improvements like making the code more type-safe and readable. With all these changes, we see this result:
And now the score became:
- 33m → 24m → 8m → 93s → 82s → 27s 😃
That is again a huge jump! And the productivity as per RTS is 98.9%!
Your Final Weapon : LLVM
At this point, it seems that we cannot do much, as we seem to have used up all the weapons at our disposal. But wait, we have something still left in the armory:
- LLVM and its various optimization flags, and possibly other GHC optimization flags.
With that change, our program’s performance is:
and our final score became:
- 33m → 24m → 8m → 93s →82s → 27s → less than 17s 😆
Also, the productivity is 98.8%, the memory usage is 7 MB and the GC takes 0.194s only. So with all these improvements, I released ray-tracers-0.2.0.
Haskell can perform really well despite being a very high level language. The implementation in Rust (1.43.1) produces the same image in 24s on my machine, and for that I don’t blame the language or the implementation as such. Eyal and I tried to find the issues and it seems that the data parallelism library called
rayon which Eyal used is not performing well enough and the
rayon developers are working on improving the scheduler due to many people having similar complaints with threads not being utilized well enough:
- see this issue: https://github.com/rayon-rs/rayon/pull/746
Once these performance issues are addressed and a new version of the crate is released, Eyal will probably update the code, possibly improving many things in the meanwhile, and if so, I’ll also provide the latest numbers on this blog. So please stay in touch!