Spark versus cuDF and dask

Published in

DataLab Log

5 min readDec 19, 2019

When working with a large amount of data, we often spend time analyzing and preparing the data. The purpose of this article is to compare the performance of two technologies very present in the big data universe. I will use Spark and cuDF to understand which commands are faster on both technologies.

Apache Spark is general purpose cluster computing system. It delivers speed by providing in-memory-computation capability. Whereas a CPU uses a few cores focused on sequential serial processing, a GPU has thousands of smaller cores made for multi-tasking.

Environment

For this test, I'll use a Hadoop Cloudera environment with six datanodes and Spark version 2.3 and we recently purchased new servers with Tesla V-100 gpu cards.

To get started, I prepared a dataset with just over 150 million rows and a few columns. The purpose here was to have a sizable database to generate some numbers.

Arquivos no formato parquet ficam bem menores que arquivos textos com delimitadores. — CSV and Parquet files

I stored this database in two different formats, parquet and csv, in order to evaluate the early reading stages when we received data from the various systems. It is well known that parquet format brings us many advantages, but we do not always receive data already in this format, and it is very common to receive raw data in delimited text files.

Reading Data

As a first test, I read the files in both environments and in both formats, see the result.

The time to read files does not differ much from the technologies tested, as we have to take into account that this requires disk I/O operations. Because parquet files are very compact, reading time is much better than reading csv files. The new GPU servers also came with ssd disks, which speeds up this kind of reading.

But the count operation was much faster on the GPU compared to Spark. A few milliseconds instead of seconds. Relatively here we begin to have visibility of the processing power of the GPU.

Group by mean()

The group by command is often used in early analysis. The idea here was to perform some commands, summarize data and evaluate performance times.

However, when I tried to do the same in the cudf environment, I had some problems.

cuDF group by csv kernel restarting

The same error happens for the parquet dataframe, and for that reason, I won’t even show the same error here.

All was not lost, I could try using dask because in this environment I have two GPUs with 32gb of memory each one.

Summary of execution times

Group by Max()

After checking that it is possible to perform group by operations with the same dataset that was used in spark, but this time using dask, I ran a few more commands.

Summary of execution times

Conclusion

Obviously processing data in GPU is much faster than CPU, but we have to consider volumes and needs. Data scientists and engineers are now known to spend a lot of time preparing data before even processing it into Machine Learning models. In my tests, times have improved a lot, but I also had problems with the volume of data.
When writing this article and when I came across problems using cuDF, I found Dask, and noticed that scalability issues can be solved with this framework. GPU servers are much more expensive than Hadoop servers if you look at it individually, but I think it is possible to achieve significant time and cost savings by properly using GPU servers. I realize that every week cuDF has developed and in a short time we will have simple ways to process large volumes of data.