Making a Case: Rust for Python developers

9 min readMar 10, 2019

I couldn’t give an appropriate title for this article. Sorry for that. Python developers encompass a huge variety of developers. I am going to target use cases for a particular set of python developers: Machine Learning developers. But for others too, it might prove to be useful. By others, I mean almost anyone coming from high level GC languages like Python, Go, Java, etc., For example go through this discussion

You must be wondering who in their right mind would write web apps in Rust, supposedly a systems language. Once you pass the learning stage in Rust, which is relatively steeper, you will come to see that it is actually a very productive language. Performance benefits aside, one really underestimated feature of Rust is how elegantly it prevents whole class of common bugs in code. Once it compiles, it’s very likely to run without any issues in production assuming no logical errors in code.

Rust has its ugly side too especially if you are using typical paradigms from other OOP languages. I will reserve this point for when I introduce Rust properly in future articles. For now, I will concentrate on Rust for data processing.

Woes of using Python…

After the dominance of R, Python now, seems to be ubiquitous in Machine Learning field. I absolutely adore Python. But everyone at some point of time would have cursed Python for its snail like speed. Most of the machine learning and deep learning libraries in python are actually written in either C, Fortran(numpy is written in both C and fortran) or C++, so they typically provide very good performance. But the problem lies in the data preprocessing step. As long as you are dealing with small data sets, Pandas and Co., work like charm. When the data size increases, people take different routes to mitigate the performance issue.

This is even more annoying in prototyping stage where you are not able to guess the runtime of certain groupby or merge operations, and waste 1–2 hours waiting for them to finish and still they don’t show any sign of wrapping up. So people resort to cython or pypy(might not be applicable in many situations) generally to speed up the process. Believe me, sometimes due to rapid development or due to lack of production level expertise of many ML developers, many of these codes intended for prototyping end up in production phase (another big topic, reserving it for another blog). Some tend to abandon python and move to Spark or it’s lesser known cousin Dask in python itself spending some money on cluster infra. I feel that this is the place where Rust shines and compliments Python quite nicely.

Introducing Rust…

I was learning Rust for long time out of sheer interest. I never used it for office work initially. But one day, there came an interesting situation when I happened to introduce Rust to my fellow colleagues.

We were analyzing GitHub data dump once. We had to join bunch of csv files in that dump. Initially we had to join these following four files.

project_topics.csv — 8.3 MB
project_languages.csv — 5.1 GB
projects.csv — 19.4 GB
watchers.csv — 4.1 GB

Operations include:

project _topics groupby based on “repo_id” column and aggregate set of topics
Previous result left join project_languages on “repo_id” and aggregate set of languages
Previous result left join projects on “repo_id” and get the primary language
Previous result left join watchers on “repo_id” and aggregate number of watchers

First two operations can be done quite comfortably in Pandas on a machine with 16GB. For some strange reason which we won’t get into here, peak memory usage went up to 10 GB while joining with project_languages even though the size of the file is just 5 GB.

For remaining operations, pandas is obviously not suitable. So we decided to use pyspark.

Here I got an opportunity to demonstrate the usefulness of Rust. I jovially made a challenge to my colleague that before you write spark script and get the result out of it, I will finish writing my Rust script and get the result.

Here is the Pyspark script:

Here is the Rust program I wrote:

Here is the time taken by both scripts to run:

Spark time:

real    24m38.163s
user    0m0.335s //time couldn't capture the jvm process user stat
sys     0m0.088s

Rust time:

real    2m52.699s
user    2m34.125s
sys     0m12.078s

After I got the result, we had to wait 15 minutes to get the result from spark script. I took me a little more than 5 mins to finish writing the Rust code. Number of lines might look bigger in Rust code compared to spark code but most of them are very similar functions with slight changes. Rust code took mere 2 mins and 52 seconds to get the result compared to spark’s 24 mins. I haven’t done any optimization in this code. It is as straightforward as it can be.

Disclaimer: I already used Rust in similar situations multiple times beforehand. Therefore I had a very clear idea of what I had to do before even starting. I even had required templates and cargo.toml file. That’s the reason it took very less time for me to finish this script in such short time. For others it might take longer time. But to be fair, learning spark and pandas also takes time for people. So I feel this is fair comparison.

Surprising conciseness of Rust…

As you can see from the Rust program, it is fairly straightforward. I used HashMap which is equivalent of python’s dictionary for aggregation. I used HashSet for creating sets of topics and languages. I read files one by one and performed the required operation and passed the hashmap to next function.

This following is the core part of aggregation. It is actually very simple to follow.

 agg //hashmap
    .entry(id) //get the entry with key “id”
    .or_insert(aggregate::new()) // if the entry is not present, then insert the empty structure
    .topic_set //if the entry is present get the topic_set field from the aggregate structure
    .insert(topic); //insert the topic to the topic_set

Similar code in python will look like this:

agg = dict()if id in agg:
    agg[id][“topic_set”].add(topic)
else:
    agg[id] = { “topic_set”: set(), “language_set”: set(), “primary_language”: None, “watchers_count”: 0 }
    agg[id][“topic_set”].add(topic)

As you can see, Rust is pretty pleasant to write with very good high level abstractions with the exception of “&” and weird “to_owned” sprinkled here and there. There are many places where I have used “unwrap()”. It is not idiomatic Rust. We are letting the program exit whenever an error occurs. For the sake of an easy introduction, I used them here. It’s always a good practice to handle the errors properly. These concepts are core part of learning Rust and I will get to them in subsequent articles.

Rust is statically typed languages unlike python. What it means is that, you have to specify the types when you are writing the code. But unlike Java and C, Rust has something called type inference which will guess the type wherever possible. As you can see in the code, I have specified types only in function signatures. In the body of the function, Rust infers the necessary types itself. It gives very pleasant appeal to Rust program in general.

If you still can’t understand some parts of the code, let me know in comments. I will try to address these concepts in upcoming series.

Is it worth it…

You might ask,do I have to learn Rust for this mere 15 mins difference?. But I would say that this difference gets added up when you are going for even bigger data sets. You might not always get Spark cluster especially in prototyping stage. Yet sometimes you will be forced to deal with huge data. In that case, instead of going for C, C++, which scares away most python developers I have seen, I think Rust provides a much better alternative. In addition, if you want to write high performance library for python, till now the most popular choice is C. But with Rust’s excellent FFI capabilities and it’s amazing abstractions and ecosystem, you can be much more productive while writing libraries.

When you actually look at it, Rust program is approximately 8.5X faster. The difference becomes profound when the size of data set increases. Let me show you another example which highlights this difference.

I got a bunch of data to analyze for one POC I was working on. I got 4 tar files collected over 4 days worth 5.5 GB data. When I extracted them, it turned out to be whopping 128GB. Obviously a lot of redundancy must have been there in those files. I decided to analyze the first day data, which was the smallest with 43 files adding upto just 9 GB. I can’t share the data details for obvious reasons. But I can give the gist of the problem.

There were 9 columns in each csv file. I had to do groupby operation on 3 columns and window the timestamp column for 30 mins and get the sum of another column. If it was just groupby, I could have used pandas chunk function and it should have been really fast. But some join operations were also there. So I had to go for other options. Again my choices were Spark and Rust.

Let’s compare the code and results(you can try to replicate the results by producing mock data for the following code):

Spark code:

Rust code:

Spark time:

real    20m8.024s
user    0m0.304s
sys     0m0.051s

Rust time:

real 1m37.523s
user 4m11.576s
sys 0m9.232s

Again 12X speedup in my machine compared to Spark. It is trivially parallelizable logic. Therefore I used the amazing Rayon library’s “par_iter()” function to parallelize the process. That’s why “real” time is less than “user” time. There are bunch of ways to optimize spark to make it a bit faster. Even then the results were just as astonishing, Now imagine the time difference for 128 GB data as whole. Assuming linear scale, 23 mins vs 286 mins.

Note:

Somethings to notice. While running on HDD, Rust code was bit slower. For example see the time result for the first Rust code when the files were read from HDD.

real    5m7.349s
user    4m17.373s
sys     0m4.444s

The real time is much higher than user+sys time despite being multi-threaded. It tells us that, lot of time, CPU was simply waiting for the file I/O. File I/O was not able to keep up with the processing speed. The process here is I/O intensive one. Here is htop screen of Rust process.

As you can see, only 30% CPU utilization was there for Rust process. So by switching to SSD, “real” time became almost equal to “user+sys” time and the process became CPU intensive. However, the same speedup was not observed for Spark code. For HDD itself, it’s CPU utilization was saturated.

Final thoughts…

My point is not that you should always use Rust instead of Spark. Spark is a very powerful beast when it comes to distributed processing. But for the data which can fit on single machine comfortably, C/C++/Rust are most likely going to be much faster. To overcome the overheads introduced by distributed frameworks, you would need a lot of machines to achieve similar speed to single machine code. Similar sentiment has been expressed by Frank McSherry in his article. Please go through the following article. It provides interesting insights.

Scalability! But at what COST?

Michael Isard, Derek Murray, and I recently sent in a HotOS submission (it's not blind, so no harm talking about it, we…

www.frankmcsherry.org

In terms of productivity, you can see that the Rust code is pretty easy to write. For typical join and groupby operations, the logic is going to very similar to what we saw earlier. You can reuse the same code in many situations.

In my opinion, Rust is easily approachable for people coming from high level programming languages compared to C/C++. It makes writing low level code as easy as high level languages if proper abstractions are used. In the following series of articles, I will try to introduce Rust with the focus on productivity and how to avoid common pitfalls and desperation usually accompanied by while learning Rust for the first time.