Spark versus cuDF and dask

Amilton Pimenta
DataLab Log
Published in
5 min readDec 19, 2019

When working with a large amount of data, we often spend time analyzing and preparing the data. The purpose of this article is to compare the performance of two technologies very present in the big data universe. I will use Spark and cuDF to understand which commands are faster on both technologies.

Apache Spark is general purpose cluster computing system. It delivers speed by providing in-memory-computation capability. Whereas a CPU uses a few cores focused on sequential serial processing, a GPU has thousands of smaller cores made for multi-tasking.

Environment

For this test, I'll use a Hadoop Cloudera environment with six datanodes and Spark version 2.3 and we recently purchased new servers with Tesla V-100 gpu cards.

Spark Environment
Nvidia Environment

To get started, I prepared a dataset with just over 150 million rows and a few columns. The purpose here was to have a sizable database to generate some numbers.

Arquivos no formato parquet ficam bem menores que arquivos textos com delimitadores.
CSV and Parquet files

I stored this database in two different formats, parquet and csv, in order to evaluate the early reading stages when we received data from the various systems. It is well known that parquet format brings us many advantages, but we do not always receive data already in this format, and it is very common to receive raw data in delimited text files.

Reading Data

As a first test, I read the files in both environments and in both formats, see the result.

Spark Read csv and counting records
cuDF read csv and counting records
Spark read parquet and counting records
cuDF read parquet and counting records

The time to read files does not differ much from the technologies tested, as we have to take into account that this requires disk I/O operations. Because parquet files are very compact, reading time is much better than reading csv files. The new GPU servers also came with ssd disks, which speeds up this kind of reading.

But the count operation was much faster on the GPU compared to Spark. A few milliseconds instead of seconds. Relatively here we begin to have visibility of the processing power of the GPU.

Group by mean()

The group by command is often used in early analysis. The idea here was to perform some commands, summarize data and evaluate performance times.

Spark group by CSV example
Spark group by parquet example

However, when I tried to do the same in the cudf environment, I had some problems.

cuDF group by csv example
cuDF group by csv kernel restarting

The same error happens for the parquet dataframe, and for that reason, I won’t even show the same error here.

All was not lost, I could try using dask because in this environment I have two GPUs with 32gb of memory each one.

Local CUDA Cluster
dask cudf group by csv example
dask cudf group by parquet example

Summary of execution times

execution times for command mean

Group by Max()

After checking that it is possible to perform group by operations with the same dataset that was used in spark, but this time using dask, I ran a few more commands.

Spark group by max operations
Dask cuDF group by max csv
Dask cuDF group by max parquet

Summary of execution times

execution time for max command

Conclusion

Obviously processing data in GPU is much faster than CPU, but we have to consider volumes and needs. Data scientists and engineers are now known to spend a lot of time preparing data before even processing it into Machine Learning models. In my tests, times have improved a lot, but I also had problems with the volume of data.
When writing this article and when I came across problems using cuDF, I found Dask, and noticed that scalability issues can be solved with this framework. GPU servers are much more expensive than Hadoop servers if you look at it individually, but I think it is possible to achieve significant time and cost savings by properly using GPU servers. I realize that every week cuDF has developed and in a short time we will have simple ways to process large volumes of data.

References

[1] https://rapids.ai/start.html

[2] https://docs.rapids.ai/start

[3] https://rapidsai.github.io/projects/cudf/en/0.10.0/10min.html

[4] https://rapidsai.github.io/projects/cudf/en/0.10.0/dask-cudf.html

[5] https://docs.dask.org/en/latest/dataframe.html

[6] https://spark.apache.org/docs/2.3.0/

[7] https://data-flair.training/blogs/apache-spark-ecosystem-components/

[8] https://www.tomshardware.com/reviews/gpu-graphics-card-definition,5742.html

--

--