Xtensor, C++, and the Julia Challenge
Last month, Simon Danisch launched the Julia Language Challenge, a programming challenge from a member of the Julia community to other programming languages.
The challenge was to create high-level routines for
- NumPy-style broadcasting of operations on n-dimensional arrays
- lazy evaluation of broadcasting expressions
with high performances. This can be easily implemented in Julia thanks to the way Julia dispatches on the most effective implementation based on argument types at (just-in-time) compilation time.
The previous submissions
There have been a few submissions already, in addition to the original Julia implementation:
- a Nim submission by Mamy André-Ratsimbazafy, who presented a very concise solution to the problem in the Nim programming language.
- a Pythran-based solution by Serge Guelton. Pythran is a Python to C++ transpiler that is NumPy-aware and can generate very efficient code. Pythran showed its effectiveness at the FastR vs Julia raytracing benchmark where it performed 55% faster than Julia from a simple NumPy-based Python implementation.
- a pure C++ solution, by Wolf Vollprecht, (co-signing this article), which uses nothing but the standard template library and a C++ compilers.
As Wolf put so well in his post about the pure C++ solution:
as C++ developers, hearing about a strong type system, templates, and specific optimization based on the static knowledge of types hits pretty close to home, doesn’t it?
This challenge was also very appealing to us as the main authors of xtensor, a lazy tensor expression system in C++.
xtensor is a C++14 template library providing
- an extensible expression system enabling lazy NumPy-style broadcasting.
- an API following the idioms of the C++ standard library.
- tools to manipulate array expressions and build upon xtensor.
xtensor exposes an API similar to that of NumPy covering a growing portion of the functionalities. A cheat sheet can be found in the documentation:
However, xtensor internals are very different from NumPy. Built upon modern C++ foundations (template expressions, closure semantics), xtensor is a lazily evaluated library, avoiding the creation of temporary variables and unnecessary memory allocations, even in the case of complex expressions with broadcasting and external API calls.
In addition to the core library, a growing ecosystem of scientific computing packages is built upon xtensor:
- xtensor-blas: linear algebra routines for xtensor expressions linking with a BLAS implementation.
- xtensor-io: APIs to read and write various file formats for images, audio, and NumPy formats.
- xtensor-python: bindings for the Python programming language, allowing the use of NumPy arrays in-place.
- xtensor-r: bindings for the R programming language, allowing the use of R arrays in-place.
- xtensor-julia: bindings for the Julia programming language, allowing the use of Julia arrays in-place, built upon the CxxWrap.jl library.
- xtensor-fftw: bindings to the fftw library, by Patrick Bos.
- xtensor-ros: bindings for ROS, the robot operating system, by Wolf Vollprecht.
- z5py: C++ implementation of the zarr and n5 formats by Constantin Pape.
The xtensor submission to the Julia challenge
The entirety of the xtensor submission to the Julia challenge can be found here.
The project includes a simple README pointing to this article, a CMakeLists.txt for building the package, and a main.cpp file.
1. Lazy broadcasting of a + b - sin(c)
with prescribed a
and b
This part is really trivial to implement with xtensor. In the following code,
xt::xtensor<double, 2> a = xt::random::rand<double>({1000, 1000});
xt::xtensor<double, 1> b = xt::random::rand<double>({1000});
double c = 1.0;// Un-evaluated broadcasting expression
auto expr = a + b - std::sin(c);
auto res = xt::xtensor<double, 2>::from_shape({1000, 1000});
expr
is an un-evaluated expression and does not hold the results of the computation. Evaluation only occurs upon assignment or access. We can then benchmark the simple assignment to a container.
// Benchmark loop
std::cout << "a[1000x1000] + b[1000] - sin(c[])" << std::endl;
std::size_t min_time = 100000000;
for (int i = 0; i < 200; ++i)
{
stopwatch timer; // Create timer
xt::noalias(res) = expr; // Evaluate the expression.
auto elapsed = timer.elapsed(); // Nanoseconds
if (elapsed < min_time)
{
min_time = elapsed; // Keep minimal evaluation time
}
}
In performing this benchmark on various architectures, we found that this implementation was consistently 10 to 12 times faster than the reference Julia implementation, and observed performances that were consistent with the pure C++ implementation and the Pythran submission. The slowness of Julia appears to be caused by a performance bug affecting this case.
2. Operating on non-scalar types
The second part of the challenge was about operating on arrays where the element type was not scalar. This is really natural to C++, which enables operator overloading, and specialization of functions on their argument types. As soon as the required functions and arithmetic operators are implemented for the element type, it will also work with xtensor expressions holding such elements.
Following the Julia reference implementation, we decided on xtensor_fixed
for our Point3
implementation. xtensor_fixed
is a stack-allocated non-resizable xtensor container.
template <class T>
using xpoint = xt::xtensor_fixed<T, xt::xshape<3>>;template <class T>
auto sum_xpoint(const xpoint<T>& t)
{
return std::accumulate(t.cbegin(), t.cend(), T());
}
Similarly to the first section, the expression sqrt(sum(a * b))
is fairly trivial to implement. Here sum
is an element-wise broadcasting of sum_xpoint
.
constexpr std::size_t psz = 1000000;
auto px = xt::xtensor<xpoint<float>, 1>({psz}, {0.5f, 2.1f, 3.2f}),
py = xt::xtensor<xpoint<float>, 1>({psz}, {0.5f, 2.1f, 3.2f});
auto res = xt::xtensor<float, 1>({psz});
auto sum = xt::vectorize(sum_xpoint<float>);// Un-evaluated broadcasting expression
auto expr = xt::sqrt(sum(px * py));
Just like in the previous section, expr
is an un-evaluated expression. It holds no memory. We can now benchmark the assignment, which is when the expression is evaluated.
// Benchmark loop
std::cout << "Benchmarking sqrt(sum(a * b))" << std::endl;
std::size_t min_time = 100000000;
for (int i = 0; i < 200; ++i)
{
stopwatch timer; // Create timer
xt::noalias(res) = expr; // Evaluate the expression.
auto elapsed = timer.elapsed(); // Nanoseconds
if (elapsed < min_time)
{
min_time = elapsed;
}
}
Similarly, when performing this benchmark, this implementation was consistently 2 to 3 times faster than the reference Julia implementation.
This benchmark can actually be improved, as we will see later in this post.
3. SIMD acceleration
Beyond auto-vectorization, xtensor deals with SIMD acceleration by explicitly making use of SIMD compiler intrinsics for a large number of processor architectures. For this purpose, the xtensor team authored xsimd, a standalone collection of high-level wrappers on compiler intrinsics.
xsimd supports arithmetic operator overloading and includes implementations of mathematical functions for batches of floating point types and integer types, with lengths depending on the available instruction sets. It includes utilities to detect of the available instruction sets on the target architecture, and also offers special memory allocators optimizing alignment of elements for their loading in SIMD registries.
The other key to xtensor’s performance is that with un-evaluated expressions, it will automatically detect upon which dimensions operations can be SIMD-accelerated, based on the memory layout of the nodes of the expressions. It will also detect, based on their type, if the dimension of the resulting expression is known at compile time or only at runtime.
We thrive to produce the most efficient assembly for a given architecture and argument types.
4. Bonus Point: multi-threading
In the Julia challenge, the multi-threading of the assignment loop is also mentioned. This is something that can be easily done with xtensor. Since the evaluation of the expression is only performed upon assignment, the assignment loop can easily be parallelized over several threads.
For this purpose, xtensor makes use of the tbb library
To enable multi-threaded assignment with xtensor, simply
- install tbb in the same prefix as xtensor
- build the benchmarks project with the additional macro definition
XTENSOR_USE_TBB
.
and voila! All assignment loops should be multi-threaded.
Outcome of the Julia Language Challenge
We found that the outcome of the challenge was really positive for all parties involved. The sophistication and performances of submissions have been amazing, and triggered many discussions in the community.
The relative slowness of Julia in the benchmarks is due to a performance bug, but we don't think that Julia has any intrinsic flaw that would make this unsolvable, and at least the challenge was useful to expose the issue.
This challenge also forced us to revise some of the assumptions that we were making in the xtensor library. Namely, xtensor relies on xsimd's aligned memory allocator to force alignment of arrays and make the loading in SIMD registers faster. We learned in the second benchmark that for small arrays (like xpoint
), the added padding was causing a lot of memory overhead, and did not have any benefits with large arrays of xpoint
objects.
What next?
xtensor and xsimd don’t exist in a vacuum. The QuantStack team is working on a larger scientific computing stack for C++, and bindings for the various languages of data science (Jupyter, Python, R). This includes interactive computing and data visualization.
C++ as a common denominator for the languages of data science
One of the key ideas in our work is that C++ can be seen as a common denominator for the main languages of data science. Data structures like N-D arrays and dataframes, as well as file formats, or communication protocols, can be implemented once in C++, and wrapped for the different target languages.
It can help long-term sustainability for the R, Python, and Julia scientific communities to collaborate on solving the common challenges together rather duplicating the implementation and maintenance effort for each language.
C++ first
A common situation with Python and R is for some parts of the code to be written in C or C++ for performance reasons, and exposed to the target language at hand. The difficulty with this approach is that the users of the community are not the developers of the tool and contributors find a cliff when they need to contributes to these parts of the code.
With the xtensor stack, we aim at creating a code base that will be directly useful and idiomatic to C++ developers, and useful for the larger community of scientific computing developers.
Another aspect of our approach is to consider C++ as a first-class scientific computing programming language, and not just for close-to-the-metal performance computing.
C++ is an interpreted language
A key ingredient to this vision is the cling C++ interpreter, from CERN. Built on the top of LLVM and Clang libraries, it just-in-time compiles user inputs. The usage of C++ and cling at CERN in the ROOT project to solve real-world research problems is the demonstration that building a scientific computing stack upon C++ is possible.
As detailed in our earlier article on Interactive Workflow for C++ with Jupyter, we presented the foundations of the interactive computing stack that we are building on the top of cling, in which xtensor plays the role of NumPy, xwidgets that of ipywidgets.
- xeus is a C++ implementation of the Jupyter kernel protocol. It is not a kernel itself, but a library that makes the creation of kernels easier.
- xeus-cling is a C++ Jupyter kernel built with xeus, and cling.
- xwidgets is a C++ backend to the Jupyter interactive widgets ecosystem, coming along with xplot, xleaflet, xthreejs, backends to most interactive widget libraries for Project Jupyter.
xtensor, xeus, xwidgets, and the cling project are the foundation for a fully-fledged interactive computing stack.
In the few screenshots and screencasts below, we give a few examples of the xeus-based C++ kernel at play in JupyterLab, making use of interactive widgets, rich mime type display and interactive data visualization packages.
Interestingly the resulting interactive C++ stack compares to Julia in how it relies on LLVM for just-in-time compilation and to generate fast code.
Together with the xtensor stack for N-D arrays and broadcasting, and the interactive data visualization packages leveraging the Jupyter interactive widgets, we may have a compelling story for interactive scientific computing in C++.
Trying it online
If you are interested in trying out the xeus-cling C++Jupyter kernel, you can check out our example binder for xeus-cling.
You can also experiment with the xtensor binder, which shows how xtensor in the C++ notebook really provide an experience similar to that of NumPy in a Python notebook. The xleaflet binder is a good display of the interactive visualization capabilities of the xwidgets stack.
About the Authors (alphabetical order)
Sylvain Corlay, Scientific Software Developer at QuantStack
Johan Mabille, Scientific Software Developer at QuantStack
Wolf Vollprecht, Scientific Software Developer at QuantStack