While Python is reaping the benefits of being around for 30 years, new programming languages are emerging which could challenge Python’s title as “King of Data Science”, the most prominent of which is Julia.
Julia was created to fix the flaws of older languages while combining their strengths. Obviously, it is aiming high — reason enough for us (Daniel and me) to have a closer look and test it in the scope of Motius Discovery.
This post covers what we learned about Julia and how it compares to Python. Before we get into Julia, let’s first start with a quick overview of Python.
Why Python became the King of Data Science
Python is an object-oriented, high-level, interpreted, general-purpose programming language that stands for clear, logical and readable code. For many reasons, it is praised to be the “King of Data Science”. Why?
When it comes to data science, developers need a flexible and versatile language that is simple to code but still able to handle complex mathematical processes. In addition to the characteristics mentioned above, Python fulfils those requirements as it:
- is cross-platform, thus enabling easy multi-environment setups
- has a high memory management capability through garbage collection
- is powered by a large collection of libraries with tools for any kind of challenge
- can use code from other programming languages to supplement Python’s lack of focus on performance.
Now that we know some background about Python, we can have a look at Julia.
What is Julia?
Julia is an open source, high-level, high-performance, dynamic programming language that represents a fresh approach to technical computing. Thanks to LLVM, Julia uses just-in-time compilation to compile its programs into efficient native code, making it usable across multiple different platforms. Due to its focus on high-performance, it is well-suited for numerical analysis and computational science and has been gaining traction in the scientific computing and data science communities since its 1.0 release in August 2018.
Furthermore, Julia has made the following design decisions:
- type system with parametric polymorphism
- multiple dispatch as a core programming paradigm
- support for concurrent, (disposable) parallel and distributed computing
- efficient libraries for a range of mathematical methods
Altogether, these features are supposed to make Julia more expressive, enable overloading, increase the range of possible use cases, and simplify debugging.
You can run Julia in read-eval-print loop (REPL), from the command line, or via a Jupyter notebook. It also comes with a built-in package manager, debugger, and profiler.
So, based on all those promises, we wanted to determine whether Julia will soon rival Python as “King of Data Science”. To do so, we implemented a Variational Autoencoder (VAEs) in both Julia and Python and compared the two experiences.
VAEs along with GANs are generative models that are becoming increasingly relevant for a wide range of use cases. Their areas of application include the generation of highly realistic human faces, composing synthetic music, object tracking, and image-to-image translation.
Julia’s machine learning ecosystem
In constructing the VAE in Julia, we tried three different libraries, namely Flux, MLDatasets, and Augmentor.
Flux is a machine learning library which “lets you use the full power of the Julia language where you need it.” When compared to the two most popular ML frameworks in Python, working with Flux feels more like working with PyTorch than TensorFlow due to its syntax. However, it is not as mature as either one of those as it lacked flexibility in building custom models and components. Furthermore, the documentation and examples were often lacking or outdated.
MLDatasets includes some of the most common benchmark datasets, e.g. Iris, MNIST, and CIFAR. Compared to equivalent Python libraries like TensorFlow Datasets, MLDatasets is rather limited as it is missing e.g. object detection and segmentation datasets.
Augmentor is a real-time image augmentation library that works similar to imgaug and we found that it performed just as well. It is simple to chain together a set of transformations which can be applied to a collection of images.
Now, let’s talk about our findings from our experiments.
Limited machine learning ecosystem and surprisingly slow
While basic workflows, such as image classification, have suitable documentation and examples in Flux, more complex models such as object detection are not referenced at all. But even with the pre-implemented models you have to be careful: some of them make deprecated function calls which means that they are not usable without some initial debugging.
Furthermore, the different types of network layers which are supported in Julia are lacking when compared to PyTorch or TensorFlow. This makes Julia’s native deep learning ecosystem and library support rather limited. Compared to Python, you cannot build complex workflows as easily. However, there is a workaround for this issue. As Julia supports calling other languages, you can easily call TensorFlow directly from your Julia code (using either Python or C bindings).
As mentioned previously, Julia was designed for high performance. Unfortunately, we experienced extremely high pre-compilation times throughout our tests. This is because every time you run a Julia script from the command line, all imported libraries are pre-compiled before any of your code is run. Consequently, iterating quickly outside of the REPL is almost impossible. In fact, we experienced waiting times of up to five minutes just to pre-compile some imported libraries like Flux. Clearly, this shows that some 3rd party libraries are not yet mature enough to be fully optimized for compilation time. So, although Julia is supposed to be faster than Python, issues like this one hamper its speed significantly, making it frustrating for developers.
Luckily, we found a workaround for that issue as well. Instead of running your code from the command line by calling ‘julia my_code.jl’, you can first start the REPL by just calling ‘julia’ and then load your code by calling ‘include(“my_code.jl”). This way, all the imported libraries will only be pre-compiled the first time, and the results will be cached to be used in successive calls of `include(“my_code.jl”) when you make a change to your scripts. However, be careful using this workaround, as variables in the global scope are not removed when calling include multiple times. In fact, this can possibly result in more bugs, as you are not starting from scratch every time you call the script.
Just like compilation time, training the VAE seemed rather slow in Julia compared to Python, as our comparison below shows. We find this indicative of the fact that a developer would have to spend a lot more time focusing on optimizing performance out-of-the-box when compared to Python.
Furthermore, we found that many Julia libraries are missing proper documentation and examples. This makes straying from the “common path” somewhat difficult in Julia, requiring more intimate knowledge of the respective libraries. Additionally, our tests uncovered that the Flux.jl library seems to have an excessive memory consumption that is significantly higher than Pytorch’s. Finally, we found that the obscure syntax makes Julia even more difficult to pick up for developers that are used to Python or C.
Why Python stays on top
Based on our tests, we can say that, for now, Julia is too immature to push Python off the data science throne. While it has an interesting language design and shows potential in many areas, the limited machine learning ecosystem and community needs to grow and mature before it can compete with Python.
However, seeing as Python has been around for three decades while Julia has not even celebrated its 10th birthday (and only 2nd birthday since its 1.0 release), it is obvious though that the future for Julia is still quite bright. If the ever-growing Julia community keeps improving the ecosystem, there is no doubt that it might have a chance to not only become highly relevant for industrial use cases, but also possibly become the “King of Data Science”.