A Beginner’s Guide To Optimizing Julia Code
How To Find And Fix Performance Bottlenecks
In this article, I will share some tips to help speed up the Julia code. The goal is to help newcomers to this language be familiar with the common tools and techniques they can use to help optimize their Julia code.
This article is a follow-up of a previous article  where I posed the following problem:
I presented two Julia-based solutions. The first version could process a 6 million row data frame in 14 seconds whereas the second version could do the same in only 218 milliseconds.
This article will explain the steps I went through to optimize the code and the intermediate results I got along the way.
Solution 1: The Naive Solution
Here’s the first attempt at a solution:
We create a variable called
results to store the counts of all possible pairs of values between columns 1 and 2 before iterating over the data frame to update the values in it (line 5). This takes 11 seconds.
The important thing to note about the code in Figure 2 is that:
- We can time code using the
@btimemacro from the BenchmarkTools package.
- We put a
$in front of
dfbecause we don’t want the evaluation of this expression to be part of the code we want to be timed.
What can we do to improve the execution time? The easiest thing to do is to refer to the Performance Tips page in Julia’s manual. This page lists many techniques to make Julia code run as fast as possible.
Solution 2: Using Performance Tips
Avoid global variables
results variable is a global variable from the perspective of the code we are timing. To fix this, we can encapsulate all the machinery related to our logic into a function that takes only the data frame as an argument.
This gives the following results:
There’s hardly any difference but at least the code is better organized. Let’s try the next tip.
In this case, we know that the values in column 1 and 2 will only range between ‘A’ to ‘Z’ and ‘a’ to ‘z’ respectively. Therefore, we can create the keys in
results and initialize them to 0 instead of creating them on the fly.
Here are the results:
The pre-allocation is done in lines 4 to 7 in Figure 4. Unfortunately, this does result in any significant improvements.
Looks like we have no choice but to whip out a profiler and dive into Julia’s internals.
Solution 3: Use A Profiler
The Juno IDE comes with a profiler that makes it really easy to profile Julia code. All you need to do is call
Juno.@profiler on the piece of code you’d like to profile and Juno will produce an interactive flame graph for you to analyze.
For example, running the following script inside a Julia REPL in Juno:
I encourage the reader to read the Profiling section in the Julia manual to understand how the graph in Figure 6 is constructed.
For the purpose of this article, all you need to know about Figure 6 is:
- Each rectangle represents a line of code
- Hovering over a rectangle will display the details of the code in the top part of the Profiler pane, as illustrated in the annotation
- Clicking on a rectangle allows you to jump to the corresponding code
- The size of each rectangle is proportional to the time spent by the function you are profiling
Figure 6 is calling out the part of
f that takes the longest to execute. This is line 21 in the
scratch.jl script which corresponds to the line
for (c1, c2, c3) in eachrow(data) . Note that the rectangles preceding this rectangle i.e. the ones above it are not related to
f so they can be ignored.
The results in Figure 6 suggest that iterating over the rows of a data frame is expensive. So let’s try rewriting
f without using data frames:
The execution time has dropped from around 11 seconds to just 188 milliseconds!
Can we do better? Well, let’s have a look at the flame graph of this version of
Notice that the size of the rectangle corresponding to line 21 is insignificant compared to Figure 6. The line that takes the most time to execute is now line 23 which involves updating the values stored in
results[key] = results[key] + val .
This section has shown that using simpler data structures e.g. switching from data frames to iterators of tuples can give a significant speed boost. So let’s try switching the type of
results from an
Accumulator type to something simpler.
What could be simpler than a one-dimensional array?
Solution 4: Simpler Data Structure
The benefit of the
Accumulator type is that it allows us to use the pair of values from column 1 and 2 as the key to store its count. We lose this flexibility with an array since indices for an array must be integers.
One way to overcome this limitation is to encode the pair of values as a sequence of 16 bits: The first byte being the value from column 1 while the second byte is the value from column two. This works because any letter can be expressed as a sequence of 8 bits.
For example, to encode the value of the pair ‘A’ and ‘b’, we note that the 8-bit value of ‘A’ is 0x41 and for ‘b’ it is 0x62. Note that the 0x prefix denotes a number in hexadecimal form.
Therefore, the index corresponding to the pair ‘A’ and ‘b’ will be 0x4162. Here’s the code to implement this encoding scheme:
And here’s how to use it in
Since we want to maximize speed, let’s sprinkle some Performance Annotations. In particular, we will use the
@inbounds macro to tell Julia to skip array bounds checking in line 6:
@inbounds macro, the time is 37.476 ms.
Let’s do another round of optimization.
Here’s the flame graph for the latest version of
The conclusion from Figure 13 is the same as the one in Figure 8: The line that’s updating the values in
results is taking the most time relative to other lines in
results is an array, a line like
results[key] = results[key] + val should only involve calling the
setindex! , and
+ functions. However, Figure 13 reveals that there are a few calls to some functions in the
iterators.jl script before the call to
getindex. This seems unnecessary.
Solution 5: No Iterators
The only iterator in
data . Let’s convert this to an array and see what happens:
This solution is faster than the previous solution by about 7 ms.
This is what it’s flame graph looks like:
Notice that the call to
getindex happens immediately after the update results line (line 6 in Figure 14).
The solution in the preceding section were based on analyzing the output of a profiler to identify bottlenecks and devising ways to circumvent them. This section will present two other approaches you can consider to optimize your code.
Rephrasing The Problem
It can be helpful to take a step back and think about the problem from a different perspective. This could yield more efficient ways of solving the problem.
In this case, the problem can be expressed in terms of a group by and aggregate operation. This can be accomplished in Julia with just one line:
It’s 10 times slower than the best solution from the previous section but it is certainly a lot better than the naive solution in terms of speed and clarity.
Most modern computers these days come with multiple CPUs/cores so it can be tempting to skip the analyzing/thinking approach and just rewrite the code to exploit the spare resources.
Julia (since version 1.3.0) has convenient facilities that make it very easy to parallelize code using multi-threading. It’s called task parallelism and you may refer to  for more details.
Here’s an attempt at using it to solve our problem:
The code in Figure 17 splits
n chunks. Each chunk will be passed to a separate thread for counting in parallel and the results from each thread are merged sequentially.
The results of splitting the data frame into chunks of 4 or 2 are slightly worse than not splitting at all. This suggests that the work done in each thread is not complex enough to outweigh the overhead of managing threads.
The appendix section in the accompanying notebook shows an example where this approach can give a linear speed-up.
This article has presented a few ways to rewrite Julia code so that it runs faster, namely:
- Read the Performance Tips section of the Julia manual
- Use a profiler to identify bottlenecks
- Rethink your approach to solving the problem
- Parallelize your code
Apparently, writing fast code involves making trade-offs between speed, convenience, and readability. For example, Solution 5 is very fast but it is a hassle to write code to look up the results, compared to if the results were stored in a dictionary.
I hope you have found this useful.
I would like to thank the following people from the Julia slack channel for their help in diagnosing the problems with the naive solution:
- Andrey Oskin
- Andres Riedemann
- Fredrik Ekre
Special shout out to Andrey Oskin for sharing all the amazing and sometimes crazy ways to write fast Julia code.
 Announcing composable multi-threaded parallelism in Julia. Bezanson, Nash and Pamnany. 2019