BigQuery, Spark or Dataflow? A Story of Speed and Other Comparisons

Sourygna Luangsay

Published in

Appsbroker CTS Google Cloud Tech Blog

14 min readFeb 10, 2023

Are you starting soon a new project and you wonder which of these technologies you should use?

How do these processing engines compare when it comes to execution speed and cost?

And if you are a developer, what are the other considerations you should be aware of?

In this post, I use the TPC-DS standard benchmark to make a fair comparison between BigQuery, Spark (on Dataproc Serverless) and Dataflow. Hopefully this content will help you choose the tool that best fits your use case and your team, so that you can take the most of Google Cloud’s Big Data processing capabilities.

Introduction

Recently, I have read a benchmark that surprised me quite a bit. First, because the Apache Beam team is quite honest: Beam/Dataflow is quite a bit slower than Spark, and they don’t try to hide that. Secondly, because I was not expecting Spark to be that much faster compared to Dataflow (TL;DR check slide #51).

As a consultant, I always thrive to advise the best tools to my customers. One that not only runs very fast, but that also does not send your Google Cloud bill through the roof. And because I enjoy developing, I also like to put the “developer experience” into consideration. The Beam benchmark, although quite complete, was saying nothing about these last 2 features.

So, time to get my hands dirty again and just run the benchmarks myself!

Description of the benchmarks

Time is money they say…
Well, let’s benchmark both of these concepts!

The considerations of the benchmarks are:

The Query #3 of the TPC-DS framework
I use the 1TB dataset of the TPC benchmark page (yes, I like Big Data. What is more, I have seen that the other datasets are too small to show some nice differences between each tool)
I compare 3 tools:
- Apache Beam / Dataflow, written in Java
- Spark on top of Spark Serverless, written in Java
- BigQuery. I added this one because if you run a SQL query, BigQuery makes sense. And I really like BigQuery.
Everything runs on Google Cloud
Both the source and output data are in BigQuery (instead of Parquet files as in the Beam benchmark). As I said, I like BigQuery. So I usually try to centralise all my data there.
I not only check the running time of each execution, but also the money it costs in Google
I don’t reuse the generic code of the TPC benchmark page. Instead, I wrote a more specialised code for Query #3. What a best way to talk about the developer experience than developing everything from scratch yourself?
I tune the configuration, code and/or query in order to get the best results from a time and cost perspective. When modifying the query, the logical query does stay the same, but I try to influence the “Query Plan”.
When computing the costs, I don’t take into account any “low threshold discounts” (similar to “the first 5TB are free” etc)
When computing the runtime, I not only include the time to “do the real computation”, but also the time it takes to create the cluster (be it Dataflow or Dataproc) and to shutdown the VMs at the end. The startup time to create the cluster is nearly always the same (both for Dataflow and Dataproc serverless): between 50 and 70 seconds, independently from the different configurations/tunings I have made.

First, let’s prepare the source data

No need to generate the test data yourself! The benchmark page nicely shares it from a Google Storage bucket:

gsutil ls gs://beam-tpcds/datasets/
gs://beam-tpcds/datasets/parquet/
gs://beam-tpcds/datasets/text/

Text here means “CSV”. With all the issues I have suffered with the CSV format, I obviously pick Parquet.

When diving deeper in this “parquet” folder, you will see a folder for each dataset size. Let’s pick the 1TB one and load the 3 sources tables for Query #3 into a new “tpcds_1TB” dataset:

bq mk -d --data_location=<TODO select a region> tpcds_1TB

bq load --source_format PARQUET tpcds_1TB.store_sales gs://beam-tpcds/datasets/parquet/nonpartitioned/1000GB/store_sales/part*.snappy.parquet
bq load --source_format PARQUET tpcds_1TB.date_dim gs://beam-tpcds/datasets/parquet/nonpartitioned/1000GB/date_dim/part*.snappy.parquet
bq load --source_format PARQUET tpcds_1TB.item gs://beam-tpcds/datasets/parquet/nonpartitioned/1000GB/item/part*.snappy.parquet

Next step: give it a first shot with BigQuery!

A SQL query walks into a bar and sees 2 tables. She walks into them and asks “Can I join you”?

The kind of things we need to invent to make feel like SQL is fun…

I won’t pronounce myself about SQL, but I do feel like BigQuery is super fun to work with!

When I have to do an analysis of data, that is always the first tool I will pick! (I have even once tried to find the name of my soon-to-be-born daughter by loading a lot of data into a BigQuery table. That was one of the few times where BigQuery failed me…)

Source: https://timewithai.files.wordpress.com/2019/05/image-8.png?w=1806&h=1016

So, obviously I started this benchmark by running the official TPC-DS Query #3 directly on BigQuery.

Very easy to develop. You just pick the official query, copy paste, change the name of your BigQuery dataset and you’re good to run the query:

 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
 FROM  tpcds_1TB.date_dim dt, tpcds_1TB.store_sales, tpcds_1TB.item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
   AND store_sales.ss_item_sk = item.i_item_sk
   AND item.i_manufact_id = 128
   AND dt.d_moy=11
 GROUP BY dt.d_year, item.i_brand, item.i_brand_id
 ORDER BY dt.d_year, sum_agg desc, brand_id
 LIMIT 100

Results:
execution time: 1 second
cost: 0.5 USD

The cost exposed above assumes we run the query with the default “on demand” model.

We can reduce this cost to 0.08 USD if we use the BigQuery Flex slots (yes, 0.08 USD takes into account that you need to compromise for at least 60 seconds).

What about the developer experience?
Well, great I would say. It took me just 10 seconds to add the name of the BigQuery dataset to get a working query.

Let’s be a bit more critical. SQL does have some downsides:

It is not a language that benefits from concepts like “objects, encapsulation, inheritance” etc. Which means that creating libraries and reusing code can be quite difficult. Although, we must admit that DBT has reduced a lot that pain
Unit Testing is really missing. Maybe it is due to the fact that BI people (the ones that usually develop with SQL) don’t have the same coding habits as Java developers. Still, I would really really appreciate to have a good framework to run fast, automated tests in SQL. DBT again can help with testing, and I have managed to write some integration tests of BigQuery in the past (not using DBT, but Airflow…). But I wished we could do that in Unit Tests (instead of Integration tests). Unfortunately, Google does not provide any official BigQuery emulator.

Moving on to the tricky part: Dataflow

Out of the 3 tools we compare, Dataflow is the newest one. And it comes with a lot of promises and interesting features on paper. Are these promises fulfill? Let’s find out!

Before diving into the code, let’s try to justify that having Dataflow (and then Spark) doing a SQL transformation from a BigQuery table and outputting into BigQuery is not a dummy exercise just for the purpose of a benchmark.

Or, as it is stated in this article:

I can’t think of a use case that would require using Beam/Dataflow to pull data from BigQuery in order to run SQL over it. You’d be much better off just querying it directly in BigQuery!

I usually agree with the author of this blog, but here I won’t be able to do so. I think running a SQL query inside Dataflow (even if your input & output are BigQuery) can be quite interesting, because you can extend the query with some advanced/complex UDF code. Of course, it is possible to develop a UDF directly in BigQuery, but then you have to code it either in SQL or in Javascript. With the downsides that SQL can be quite limited, and that there are more than one developer in the world that do not have Javascript in their heart to say it politely.

Enough talking, here’s the code I developed for the benchmark: https://github.com/Cloud-Technology-Solutions/DataflowQuery3

As I said before, I have had to do some tuning on the code and configuration in order to get the best outcome. The cost above is the one that provides the best results, which are:

Results:
execution time: 395 seconds
cost: 1,28USD

What about the developer experience?
Lot of things to say here!

Source: https://gonewiththetwins.com/new/good-the-bad-and-the-ugly-1968/

The good:

You can launch a Dataflow job just by pushing the “Run” button of your IntelliJ (or your favorite IDE). Something I really appreciate as a developer.
All the job metrics are integrated into the (Dataflow) Google Console. You can easily spot how many rows have been processed, what is the longest step of the DAG, how busy were the CPUs etc. Quite useful when you want to tune things!

The bad:

I had to turn off the Shuffle Service. It did reduce the execution time, but it was a mere 12% reduction. Very far from the numbers usually claimed for this feature. On the other side, the cost was increased a lot when using the Shuffle Service: it got nearly multiplied by a factor 3!
I also saw no benefit at all for using the Dataflow runner v2. On the contrary, when using it my code ran 45% slower. I appreciate that it will give the possibility to a Data Scientist (coding in Python) team to integrate their piece of code in the main Java Dataflow pipeline of another Data Engineer team (to be honest, I feel a bit doubtful as for the code ownership and responsibility aspects… but let’s just focus on the technological aspect for now). However, if the runner v2 involves a performance decrease (what is more, it enforces the Shuffle Service. So cost is also higher), then I guess I’ll be a bit afraid to use the last versions of Apache Beam if this runner becomes the default one.
Dataflow does not push down the filters (in the “WHERE conditions” of the query) to the BigQuery Storage API. To get that optimisation, I had to duplicate the filters and manually put them in the Dataflow step where I call the BigQuery read storage API as you can see in the InputReader.getRows() method:

if (filter == null) {
    return reader;
} else {
    // predicate pushdown
    return reader.withRowRestriction(filter);
}

The ugly:

The deserialisation of the BigQuery rows is a real pain to code. I am quite happy with the current version of my code, and I have spared you with more complex versions where I was creating some dedicated classes for each tables, using some inheritance and generics. It took me quite some time to have it working properly and the Apache Beam documentation hasn’t really helped me here.
When I say I am quite happy, it’s just because it’s much better than my previous versions. But still, the serialisation code is still very complex: you have to define the Schema for each table, to do some proper casting, and then to put all the tables in a Tuple. Can’t the framework just get all this metadata information directly from BigQuery and do all this for me? Spark does a much better job here!
Speaking about Schema: you cannot use the DECIMAL type. Unless you want to see this wonderful exception:

java.lang.UnsupportedOperationException: Data type: DECIMAL not supported yet!

This means that I had to switch back to the DOUBLE type in the InputReader.getStoreSales() method:

.addNullableField("ss_item_sk", INT64)
// Unfortunately Decimal is not supported!
.addNullableField("ss_ext_sales_price", DOUBLE)

So much if you are working for a Finance team and you need some exact precision in your computations…

The Apache Beam transformSQL operator does not optimise the Join operation (meaning: choosing between hash joins vs shuffle joins). I had to manually modify the order of the tables:

-- FROM  date_dim dt, store_sales, item  // creates bad Join performance
FROM  store_sales, item, date_dim dt

By just moving the biggest table (in this case: the “fact” table) in the first position, I decreased the execution time by a factor 3. Once again, the metadata information about the size of each table is easy to get from BigQuery. Why can’t Apache Beam do this optimisation itself??

Last but not least: Spark on Dataproc

So far we have seen BigQuery, and then Dataflow. The last one in our comparison can’t be anything else than the most popular Big Data processing engine: Spark!

Spark is often run on top of a Hadoop cluster. Google Cloud Dataproc offers an easy way to get such a cluster. And last year, Google has made this experience even smoother with the introduction of Dataproc Serverless, which is the service I am using to do the Spark benchmarks (for more information on Dataproc Serverless, my colleague Ash has written a blog post which I invite you to read. And if you want to try it with Python, you can also look here)

Here is the code I have developed for Spark:
https://github.com/Cloud-Technology-Solutions/SparkQuery3

As for Dataflow, I have launched several runs and done a bit of tuning, and I am presenting here the best results.

Results:
execution time: 148 seconds
cost: 0,17 USD

What about the developer experience?

Source: https://www.howtogeek.com/403438/do-the-people-you-follow-on-social-media-spark-joy/

In general, my experience in developing with Spark has been much more enjoyable than with Apache Beam:

What I liked about Spark:

It was much easier and faster for me to develop in Spark than in Apache Beam (even though I have much more experience coding with Beam). For instance, getting the data out of the BigQuery table was straightforward and I nearly just had to do a copy-paste from the Spark documentation:

Dataset<Row>  dataset = session
        .read()
        .format(Parameters.READ_FORMAT)
        .load(fullTableName);

dataset.createOrReplaceTempView(table);

Spark is fast. Very fast. And also efficient. I was able to run the benchmarks with the 100GB dataset just with the “local mode” (ie: running Spark in my IntelliJ instead of launching it on Dataproc) ; something impossible with Apache Beam. This really speeds up my coding experience when I want to do quick tests before spending more time to deploy my job on the Cloud.
Tuning was easy. I did have to change a bit the amount of CPU and RAM of the driver and executors to get the best runtime/cost balance, but it was not necessary to disable some features as in the case of Dataflow.

Drawbacks I have encountered:

You cannot launch the Spark job on Dataproc directly from IntelliJ. You first have to build the Jar, then upload it to GCS and finally run the “gcloud dataproc” command. Small detail, but it makes the experience less enjoyable
Dataproc Spark Serverless has a lower degree of integration with the Google Cloud Console than Dataflow. Which means getting some metrics is harder or sometimes not possible (for instance, I could not find out in the metrics how many bytes were read from the BigQuery tables). Sure, you can add a standalone “Spark History Server” and get more details, but it means having to maintain a Dataproc cluster (always running? At least every time you want to run some Spark Serverless jobs…) and you would still not get as much information as in the Dataflow console
My biggest complaint might come from the way the Spark BigQuery Connector “overwrites” the data in BigQuery. If the output table exists (even if there is no data inside), then the BigQuery connector will operate some additional SQL “merge” operations on the output table.
This can be seen in the BigQuery console by using the new “Lineage” feature:

The consequence of that (apart from increasing your costs, because you now run some extra BigQuery queries in addition to your Spark job) is that the BigQuery final table is no longer ordered (even though SparkSQL did apply the ordering specified in SQL query). I usually don’t bother about the order of the rows in my BigQuery tables (in my view, ordering should be applied at the visualisation layer), but still, I find this implementation quite dirty.

To summarise

If we put all the results together, we get this:

As for the costs, I must admit that I was surprised to see that Spark/Dataproc could be cheaper than BigQuery (if you use the default “on demand” billing mode…).

When talking about the costs, we must not forget the “developer experience” attached to each of these tools: having a team efficient with the chosen technical solution will obviously impact a lot your TCO and your “time to market”.

As for performance, these results are in line with the original benchmarks where we had the following execution times:

Spark: between 100 and 200 seconds (sorry for not being precise, but the slide #51 does not allow me to know the exact number)
Dataflow: between 1300 and 1400 seconds (to do a fair comparison with my benchmarks, we have to look at the “unlimited workers” execution on slide #53)

We can see that my results show quite an improvement on Dataflow (which runs 3 times faster in my benchmarks than in the original benchmarks). I don’t know if this is due to the fact that I use a more recent version of Dataflow or because my inputs/outputs are BigQuery instead of Parquet files.

I also suspect that my results are also better in the case of Spark, because when I look at the Spark 1GB result on slide 50, I guess their runtime doesn’t take into account the time to create the Hadoop cluster (EMR cluster in their case, instead of Dataproc cluster in my case). If in my benchmarks I were to use the same approach to compute “runtime”, then Spark would run in just 87 seconds instead of 148.

Final words

If you had to remember just 3 things out of this long blog post (congrats, you’ve reached the end!), I would propose:

When it comes to speed, BigQuery is always the winner: fastest execution and fastest development
Spark/Dataproc outperforms:
- Dataflow in terms of runtime, costs and developer experience
- BigQuery from a cost perspective (if you use “on demand” BigQuery mode)

However, don’t forget…

Source: https://i.pinimg.com/originals/29/e4/ab/29e4ab462c3f7163d37431f7922ca2b9.png

Don’t trust me!

Better said - do your own research and don’t take the above conclusions for granted.

This blog post just reflects my personal experience with BigQuery, Dataflow and Spark by running specifically the TPC-DS Query3.
You have access to the source codes I have used, so feel free to challenge these results, or to try with one of the other 100 queries in TPC-DS.

And if you really want to make the best decision before starting the development of a new project, don’t forget that the best benchmarks are the ones using your own use case and your own data!

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!