FastSpark: A New Fast Native Implementation of Spark from Scratch
TLDR: Here is the code to explore.
It all started during my hobby research on various distributed schedulers and distributed computing frameworks. Naturally, Spark came under the bracket. I was already somewhat familiar with Spark internals since I have been using it for over 3 years. It struck me then that one of the primary reasons why it became hugely successful is not just because of its speed and efficiency, it is due to its very intuitive APIs. This is the same reason why Pandas are also extremely popular. If not, there are arguably better alternatives, if one has to look for performance, like Flink, Naiad, etc., or HPC frameworks like OpenMP.
The thing which I like the most about Spark is that it is a general-purpose distributed framework. When you deal with unstructured data or more complex tasks, RDD seems very natural. Now Spark is all about DataFrame and SQL and they are now universally preferred over RDD. Dataframe almost always produces better performance than RDD. RDD is the building block of the Spark ecosystem. Then how come Dataframe always produces a better result? Is it due to query optimizer? Take the simplest query, you should definitely be able to define optimum data and compute flow using RDDs, but still, the DataFrame implementation will most likely beat yours. The reason is the magic happening in the Tungsten engine. Spark relies on RAM for performance. JVM becomes resource-hungry very quickly in Spark’s typical tasks. So they circumvented this issue by managing raw memory directly using “sun.misc.Unsafe”. The implication of this is that the DataFrame API is not as flexible as RDDs. They can deal only with fixed predefined data types. You can’t use arbitrary structs/objects in dataframes and operate on them. In real-world problems, typically most of the analytical workloads can be solved using this. But still, there are use cases where RDDs are convenient to work with.
I got a project idea to test the feasibility of implementing Spark in a native language and if feasible, explore how efficient it can be in terms of performance and resource management. I know that Spark is heavily optimized over the years. I didn’t hope for any drastic difference in performance and if some difference is there, it most likely will be in RAM usage. Also, I want it to very general-purpose just like Spark. I decided to use Rust for the implementation. Well, there aren’t many alternatives available out there. C++ is well…