A Beginner’s Guide To Optimizing Julia Code

How To Find And Fix Performance Bottlenecks

___
___
Mar 22 · 9 min read

Introduction

In this article, I will share some tips to help speed up the Julia code. The goal is to help newcomers to this language be familiar with the common tools and techniques they can use to help optimize their Julia code.

I assume the reader can read Julia code and has the Juno IDE installed. The code to reproduce the results presented in this article can be found in this notebook.

Motivation

This article is a follow-up of a previous article [1] where I posed the following problem:

Figure 1: The problem statement

I presented two Julia-based solutions. The first version could process a 6 million row data frame in 14 seconds whereas the second version could do the same in only 218 milliseconds.

This article will explain the steps I went through to optimize the code and the intermediate results I got along the way.

Solutions

Here’s the first attempt at a solution:

Figure 2: First attempt at a solution

We create a variable called results to store the counts of all possible pairs of values between columns 1 and 2 before iterating over the data frame to update the values in it (line 5). This takes 11 seconds.

The important thing to note about the code in Figure 2 is that:

  • We can time code using the @btime macro from the BenchmarkTools package.
  • We put a $ in front of df because we don’t want the evaluation of this expression to be part of the code we want to be timed.

What can we do to improve the execution time? The easiest thing to do is to refer to the Performance Tips page in Julia’s manual. This page lists many techniques to make Julia code run as fast as possible.

The following tips seem relevant to our code: Avoid global variables and Pre-allocating outputs. Let’s try them out and see the effect they have on our code.

Avoid global variables

The results variable is a global variable from the perspective of the code we are timing. To fix this, we can encapsulate all the machinery related to our logic into a function that takes only the data frame as an argument.

This gives the following results:

Figure 3: Avoiding global variables

There’s hardly any difference but at least the code is better organized. Let’s try the next tip.

Pre-allocating outputs

In this case, we know that the values in column 1 and 2 will only range between ‘A’ to ‘Z’ and ‘a’ to ‘z’ respectively. Therefore, we can create the keys in results and initialize them to 0 instead of creating them on the fly.

Here are the results:

Figure 4: Pre-allocating output

The pre-allocation is done in lines 4 to 7 in Figure 4. Unfortunately, this does result in any significant improvements.

Looks like we have no choice but to whip out a profiler and dive into Julia’s internals.

The Juno IDE comes with a profiler that makes it really easy to profile Julia code. All you need to do is call Juno.@profiler on the piece of code you’d like to profile and Juno will produce an interactive flame graph for you to analyze.

For example, running the following script inside a Julia REPL in Juno:

Figure 5: Sample script to demonstrate code profiling in Juno

gives:

Figure 6: Output of executing the code in Figure 5

I encourage the reader to read the Profiling section in the Julia manual to understand how the graph in Figure 6 is constructed.

For the purpose of this article, all you need to know about Figure 6 is:

  • Each rectangle represents a line of code
  • Hovering over a rectangle will display the details of the code in the top part of the Profiler pane, as illustrated in the annotation
  • Clicking on a rectangle allows you to jump to the corresponding code
  • The size of each rectangle is proportional to the time spent by the function you are profiling

Figure 6 is calling out the part of f that takes the longest to execute. This is line 21 in the scratch.jl script which corresponds to the line for (c1, c2, c3) in eachrow(data) . Note that the rectangles preceding this rectangle i.e. the ones above it are not related to f so they can be ignored.

The results in Figure 6 suggest that iterating over the rows of a data frame is expensive. So let’s try rewriting f without using data frames:

Figure 7: No data frame solution

The execution time has dropped from around 11 seconds to just 188 milliseconds!

Can we do better? Well, let’s have a look at the flame graph of this version of f :

Figure 8: Flame Graph of the no data frame solution

Notice that the size of the rectangle corresponding to line 21 is insignificant compared to Figure 6. The line that takes the most time to execute is now line 23 which involves updating the values stored in results i.e. results[key] = results[key] + val .

This section has shown that using simpler data structures e.g. switching from data frames to iterators of tuples can give a significant speed boost. So let’s try switching the type of results from an Accumulator type to something simpler.

What could be simpler than a one-dimensional array?

The benefit of the Accumulator type is that it allows us to use the pair of values from column 1 and 2 as the key to store its count. We lose this flexibility with an array since indices for an array must be integers.

One way to overcome this limitation is to encode the pair of values as a sequence of 16 bits: The first byte being the value from column 1 while the second byte is the value from column two. This works because any letter can be expressed as a sequence of 8 bits.

For example, to encode the value of the pair ‘A’ and ‘b’, we note that the 8-bit value of ‘A’ is 0x41 and for ‘b’ it is 0x62. Note that the 0x prefix denotes a number in hexadecimal form.

Therefore, the index corresponding to the pair ‘A’ and ‘b’ will be 0x4162. Here’s the code to implement this encoding scheme:

Figure 9: Function to convert a pair of characters into an integer

And here’s how to use it in f :

Figure 10: Tracking the counts using an array instead of an accumulator

Since we want to maximize speed, let’s sprinkle some Performance Annotations. In particular, we will use the @inbounds macro to tell Julia to skip array bounds checking in line 6:

Figure 11: Same as Figure 10 but with array bounds checking eliminated

This gives:

Figure 12: Timing for Figure 11

Without the @inbounds macro, the time is 37.476 ms.

Let’s do another round of optimization.

Here’s the flame graph for the latest version of f :

Figure 13: Flame graph for f in Figure 11

The conclusion from Figure 13 is the same as the one in Figure 8: The line that’s updating the values in results is taking the most time relative to other lines in f .

Since results is an array, a line like results[key] = results[key] + val should only involve calling the getindex , setindex! , and + functions. However, Figure 13 reveals that there are a few calls to some functions in the iterators.jl script before the call to getindex. This seems unnecessary.

The only iterator in f is data . Let’s convert this to an array and see what happens:

Figure 14: The no iterator solution

This solution is faster than the previous solution by about 7 ms.

This is what it’s flame graph looks like:

Figure 15: Flame graph of the no iterator solution

Notice that the call to getindex happens immediately after the update results line (line 6 in Figure 14).

Other Approaches

The solution in the preceding section were based on analyzing the output of a profiler to identify bottlenecks and devising ways to circumvent them. This section will present two other approaches you can consider to optimize your code.

It can be helpful to take a step back and think about the problem from a different perspective. This could yield more efficient ways of solving the problem.

In this case, the problem can be expressed in terms of a group by and aggregate operation. This can be accomplished in Julia with just one line:

Figure 16: The data frame solution

It’s 10 times slower than the best solution from the previous section but it is certainly a lot better than the naive solution in terms of speed and clarity.

Most modern computers these days come with multiple CPUs/cores so it can be tempting to skip the analyzing/thinking approach and just rewrite the code to exploit the spare resources.

Julia (since version 1.3.0) has convenient facilities that make it very easy to parallelize code using multi-threading. It’s called task parallelism and you may refer to [2] for more details.

Here’s an attempt at using it to solve our problem:

Figure 17: Solution using task parallelism

The code in Figure 17 splits df into n chunks. Each chunk will be passed to a separate thread for counting in parallel and the results from each thread are merged sequentially.

The results of splitting the data frame into chunks of 4 or 2 are slightly worse than not splitting at all. This suggests that the work done in each thread is not complex enough to outweigh the overhead of managing threads.

The appendix section in the accompanying notebook shows an example where this approach can give a linear speed-up.

Conclusion

This article has presented a few ways to rewrite Julia code so that it runs faster, namely:

  1. Read the Performance Tips section of the Julia manual
  2. Use a profiler to identify bottlenecks
  3. Rethink your approach to solving the problem
  4. Parallelize your code

Apparently, writing fast code involves making trade-offs between speed, convenience, and readability. For example, Solution 5 is very fast but it is a hassle to write code to look up the results, compared to if the results were stored in a dictionary.

I hope you have found this useful.

Acknowledgment

I would like to thank the following people from the Julia slack channel for their help in diagnosing the problems with the naive solution:

  • Andrey Oskin
  • Andres Riedemann
  • Fredrik Ekre

Special shout out to Andrey Oskin for sharing all the amazing and sometimes crazy ways to write fast Julia code.

References

[1] How Practical Is Python For Prototyping Data Science Projects At Scale? ___ . 2020

[2] Announcing composable multi-threaded parallelism in Julia. Bezanson, Nash and Pamnany. 2019

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

___

Written by

___

About me: https://bit.ly/2GQX7i9

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

More from Towards AI

Image Filtering

More from Towards AI

___
Mar 29 · 8 min read

103

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade