Rust and its stance in data science
Last updated in 2018–04–04 with a few recent notes and mentions.
This isn’t something that I would do very often, but a call was made, and I would like to take that chance to fill in some ideas with another context in mind. Some of them may be a bit controversial or biased.
As a computer engineer pursuing a PhD in computer science, I often see this as a clash of worlds. Doing actual science and obtaining results fast and productively is extremely important, since we are often evaluated by our scientific publication output. On the other hand, many concerns of conceiving production-ready solutions with that state-of-the-art are frequently left as a second priority, given the technical debt that not many research groups worry enough to overcome.
So. Data science. To be honest, maybe just “data science” can be too narrow. By technologies in data science, I am referring to those usually employed by researchers in machine learning, statistics, artificial intelligence, and other fields where some level of mathematical computation is employed. They may often involve a cycle where models are designed, trained, measurements are made, observations are taken, parameters are fine-tuned, and back to step 1 or 2 we go.
Without extending the introduction any further, here are the points that, in my opinion, should be considered when working with Rust in these (mostly academic) fields.
Rust is an amazing programming language. Its focus on memory safety, efficiency and performance makes it a great candidate for constructing frameworks and tools for machine learning and data analysis, which can make the best of the available resources in a computer. With that said, let’s stop that thought for a moment and keep in mind that many mature technologies for data science exist today. We have R, with a reasonably wide environment designed for statistics. We have MATLAB (and its alternate free brother Octave), which like it or not, is still extensively used in research and widely taught in science degrees, both inside and outside computer science. We even have Julia, which I like to call MATLAB’s cool younger cousin, and it boasts some interesting perks of its own. And of course, Python currently holds a pretty large piece of the DS cake. Not because it was specifically designed for these purposes, but because the language is simple enough to attract the less code-savvy, and because every library you’d ever need is in there. And many people would rather keep defying gravity than choosing a stack without the necessary tools for the job.
By creating new Rust tools for data scientists, we could be taking the unnecessary risk of “competing” with all of the others without a clear reason why someone should be switching other than “just because” (or are you really throwing in the argument that it’s safe and provides fearless concurrency?). Moreover, it’s not like we’re supposed to shape Rust to fit the use cases of data scientists, which could in the worst case lead to the mistake of making just another compiled Python. As well stated in this other Rust 2018 blog post, even these old languages and technologies have their place.
Therefore, I would like to point out that integration should be a major focus for Rust. This works both ways: (1) being able to use non-Rust solutions in Rust; and (2) enabling non-Rust technologies to use software written in Rust. It’s not too bad if we don’t have a pure Rust solution, but having a familiar framework accessible from Rust is important. For example, the Leaf project didn’t quite work out, but we can use TensorFlow today, or at least enough to load saved models and serve them through a Rust stack, thanks to the actively maintained bindings.
This concern isn’t new, and our ecosystem has gone a long way towards these goals. FFI is the main road for native interactions. We have
bindgen, which translates C APIs into Rust bindings. What we do not not quite have yet is an easy way to make bindings from C++ interfaces. The only approach known to work pretty well is not to use C++ APIs at all: just create pure C headers and the respective wrapper implementation. I hope that we can improve on this end. Can we have a look at SWIG, for example?
2018–04–04 Update: If you wish to learn more about writing Rust bindings to C++ libraries, consider reading my story on Taking the long road.
As last year’s efforts in the Rust ecosystem included an asynchronous network model (yes, that’s Tokio), we can use Rust to turn these models into network services. Oftentimes, the web API can be as simple as sending serialized objects (with
serde, of course!) over another grounded network protocol (HTTP, plain TCP, or another network layer abstracting those, such as ZeroMQ or nanomsg). Integration with non-Rust technologies becomes mostly a solved problem at this point. In the process, let’s not forget existing standards and other commonly used formats. Always choose to consume or implement existing standards. For instance, the Khronos group has recently released a provisional specification of the Neural Network Exchange Format (NNEF), intended to harmonise neural network tools and inference engines. From my perspective, if Rust is to have a valuable position in deep learning, it ought to keep NNEF in mind, and perhaps the same goes for formats from well established deep learning frameworks.
2018–04–04 Update: one initiative of writing a pure Rust parser of NNEF files was made last month. Great!
Way before we think about making new tools for data scientists and the like, we should consider the means through which we can add solutions written in Rust. Think of it as a sandwich, were we can use Rust to make a native implementation of demanding algorithms, and at the same time serve these solutions with production-ready servers. The scientific value of the approach would be sitting in the middle, which could be written in different languages. This includes exposing non-Rust APIs out of pure Rust solutions.
I have come to realise throughout my years as a PhD student that the wrong shiny tool for the job can make you waste much more time than the right yet not shiny one. So it happens that, although the number of crates and number of crate creators are steadfastly increasing, it’s not hard to spot some useful functionalities often employed in data science which are not available. One of them, although not necessarily one that would strike you as a major flaw, is reading and writing to files in the HDF5 format. The crates that we have today are either incomplete or in a state of very difficult use.
hdf5-rs seems to be the one closest to becoming usable however, and one of my wishes for 2018 is that a new feature-complete release is made for this particular crate.
> As of 2019–03–12, this HDF5 library, now released into the
hdf5 crate, is in a much more usable state!
Of course, there are many other, more popular crates which work today, but would benefit from reaching stabilization.
ndarray, for example, may become the cornerstone Rust interface and implementation of multi-dimensional arrays, just like
numpy is in a Python environment. That also means it should be accompanied by sparse array implementations, at least in a separate (also stable) crate: could it be you,
The community at large can help with this. Have a look at projects on GitHub which are looking for help, especially those still far away from v1.0. The site www.arewelearningyet.com is the de facto aggregation of machine learning tools for Rust developers, and is worth keeping an eye on. See this list of not-yet-awesome things in Rust, most of which are related to mathematics and machine learning. Moreover, consider visiting the ecosystem Working Group, which is focused on the sustainability and maturity of Rust. And yes, don’t feel disinclined to make tools for data scientist. It may sound contradicting to the previous section, but that’s what the following section is for:
I will end with a semi-open question: what makes an ideal tool or library for data scientists? In my opinion, we can outline a few points.
- They provide extendable interfaces, so that more algorithms and components can be easily coupled together in a single script. The Python ecosystem does this by using common data structures and by “mimicking” those interfaces in custom types (namely
pandas, to name the most important ones).
- They are easy to use without crippling performance. One way to achieve this is to make interfaces that users are familiar with from other technologies, while retaining what makes the code idiomatic. Can we make
plt.plot(x,y)in Rust just as easy? Sure thing!
- They are fast, efficient, and can be hardware-accelerated. Many algorithms employed in machine learning are much faster when run on one or more GPUs. There are no predictions of that changing in the near future. While we can claim that Rust code is pretty well optimised, the difference is less relevant when relying on GPU-accelerated computation APIs such as CUDA and OpenCL. It’s good that we currently have open solutions to his, to some extent, but library developers should not forget to use them. Even if just doing things on the CPU, consider whether SIMD can be used. If you do not wish to deal with low-level intrinsics, how about using a middle-level crate such as
faster, or even BLAS and LAPACK bindings?
- If it’s a library or a framework, the programming language used should be good. For a language that is only close to being 3 years since 1.0, it’s going pretty well. This bullet point can refer to what so many other Rust2018 blog posts have stated about the future of Rust.
Let me know of what you think!