BigQuery vs Synapse — who is currently winning the battle?

Photo by Mateusz Wacławek on Unsplash

Data Warehouse — a place where the magic happens. A place where you can throw massive amounts of data from various origins, write some SQL and draw meaningful conclusions — and this is all without stressing out about your CPU power.

The definition of a data warehouse evolved tremendously in the last ten years. Companies started to give up the traditional, obsolete service-oriented architectures and joined the rush movement into cloud computing. As part of this change, such products as Google BigQuery, Azure Synapse and Amazon Redshift appeared on the market, offering extensive data analysis for a reasonable price.

To not become a believer of one tool, it is sometimes worthwhile to revisit the market, and compare your main weapon with the other alternatives. While working on projects for a wide range of clients, where each one has different needs and preferences, I have recently had a chance to dive deeper into Synapse. Upon project completion, I have decided to compare it to BigQuery, where my previous experience was based, and share some thoughts based on what I discovered/observed. Taking into respect multiple categories, we will try to answer the question — is one product surpassing the other in 2022?

Cost

So let’s start with what the most individuals want to know first — the price. And although predicting it with a fine-toothed comb is virtually impossible, the choice of the used pool type is going to influence the end price the most, therefore I am going to focus on this aspect.

NOTE: Wanting to make the costs clear and transparent to everybody, this section turned out to be the longest. Having in mind not everybody is interested in such thorough descriptions, I have prepared a shortened version of costs in form of a table, after which you are officially allowed to move to the next section.

* Prices with respect to London’s region i.e. ‘uk-south’ in Azure and ‘europe-west2’ in GCP

For the ones wanting to dive deeper and know all the details — here you are!

DEDICATED POOL - Synapse:

In this option, we decide to pay in advance for the specific pool performance we want to use. The power however, is very flexible and can be adjusted to the current need anytime. The hourly cost of the smallest cluster — DW100c is going to cost us in this moment £1.40, while such a monster as DW30000c will be even £421.14! Charges are only applied when the cluster is up and running, therefore, forgetting to pause it when is idle may cost you a pretty penny. Using Azure functions, we can automate that process, pausing compute instances at a fixed time or after a specified amount of time.

SERVERLESS POOL - Synapse and BigQuery:

Serverless is another possibility our queries can run. It is the native environment for BigQuery, which was designed and built from the beginning as the serverless platform, whereas in Synapse it is an option. The payment models of this mode are very similar in both services and are very cost-efficient. Instead of paying for the resources deployed, you pay per-query — and the more computationally heavy it is, the higher amount you need to pay. In this option you do not care about pausing clusters, as the charge is only made after executing a query. The price of processed TB is slightly cheaper in Synapse at £5, when its £6.25 in BigQuery. Both services offer the first TB of processed data for free and the quota renews every month! This benefit, however, will be dropped out from Synapse in the end of December 2022.

The other cost we need to add on to both services is saving and storing the tables. In case of BigQuery tables are saved internally, which means any data used for analysis is priorly imported directly into BigQuery. Synapse on the other hand does not support internal tables in the serverless option, and tables are saved externally on Data Lake Storage Gen2. The advantage of internal tables is that running queries against them (especially if tables are very large) is measurably faster in comparison to the external tables. The disadvantage is the slightly higher price; 1GB of stored data in BigQuery is going to cost monthly ~£0.023, whereas in Synapse ~£0.016. However, the pricier service allows storing the first 10GB of data per month charge-free. Both cases assume the table is stored in the closest region (in this case, London), is frequently accessed and modified.

ALTERNATIVE PAYMENT OPTIONS - Synapse and BigQuery:

For those whose pockets are not the bottomless pits and want to have a greater control over the bills, both services offer something to prevent ending up a bankrupt. In BigQuery you can opt for flat-rate pricing option and pay for the purchased amount of BigQuery slots (processing power) on the annual, monthly or flex commitment plans. Synapse on the other hand offers ‘safe limits’ … which cause a Budget Exhausted error when crossed.

Trial period:

  • The trial period in Synapse lasts 30 days, during which we can enjoy the free $200 credit given us by Microsoft to explore all of the tools Azure has to offer. Some of them remain free up to one year. That looks even better in Google, where the new users receive free $300 credit valid for all of the cloud tools for 90 days!

Student pricing plan:

  • Azure offers students a very attractive option to use the cloud, giving them $100 credit, even without the need to provide a credit card! Although Google offers students their qwiklabs where they can learn the fundamentals of cloud computing, there is not really an option for more experienced young users wanting to create and manage their own projects at a reduced price.

Summing the costs section up, the pricing plan in BigQuery looks slightly more attractive from the one in Synapse, therefore BigQuery is the winner in this category.

Stats so far: Synapse: 0, BigQuery: 1

Speed

This factor is the toughest to measure as it depends on plenty of factors to be considered and in the end, probably everybody would need to do their own benchmark. Mine was based at the serverless option, as this is the one we run the most projects on in Datasparq. What I was mainly interested in was, how the tool deals with the complex operations on big datasets and conversely — how long does a simple SELECT * take?

To check this out, I started big and ran a query to first create an artificial 120 million row dataset based on CROSS JOINs, then ran a rolling window on it and finally save the result table. This is the operation I was worried about the most in Synapse because first — it is very computationally heavy, and second — unlike BigQuery, Synapse does not allow users to save tables internally and forces saving them externally in .parquet files. My fears however, turned out to be totally unfounded, and Synapse surprised me very positively, completing the operation in a very similar time to BigQuery. The only major speed difference I have spotted between these two services was running the SELECT * statement at the newly created datasets. When BigQuery was repeatably completing the operation within 15–20 seconds, the process of listing rows in Synapse was going on even 10 minutes after BigQuery was done!! I found this circumstance very interesting, as contrary to my guesswork, saving results to the external tables was taking around 30 times shorter than producing the result-set.

Concluding, we do not need to worry about our increasing dataset, in either of these services. Both of them can perform complex operations on a large amount of data very smoothly. However, not both of them can handle retrieving rows to the result tables equally well, therefore the speed classification belongs to BigQuery.

Stats so far: Synapse: 0, BigQuery: 2

Simplicity

Leaving the underlying technology of the tool behind, one of the most detrimental parameters the team should be thinking of is how quickly the final solution can be implemented and delivered. That is why simplicity turns out to be a very deterministic component with a high weight.

Realistically, it does not take more than 5 minutes to set up BigQuery. What we need to do to start using BigQuery is to create a bucket in Google Cloud Storage (GCS) and then load files into it. Afterwards, we can directly head up to BigQuery to create a dataset and a table under which we can point our file located in GCS. This is the bookish way, however, BigQuery offers a more lazy option as well being able to upload the file directly there, point the file located on GDrive, or even on the solutions of competitors such as Amazon S3 or Azure Blob storage!

To run the first query in Synapse, on the other hand, we are required a prior set up of a resource group, storage account and finally a Synapse Workspace, which altogether takes considerable longer. The point goes to BigQuery.

Stats so far: Synapse: 0, BigQuery: 3

Running queries in Python:

At some stage, while building a data engineering pipeline we reach to the point where running queries needs to be automated. From the abundance of scripting languages we most likely choose Python, thanks to its simplicity and a great support of Data Engineering and Data Science tools.

Running SQL queries against BigQuery from the Python perspective is very simple and is possible through the BigQuery Python client running the query() method just after specifying the project, dataset and table names.

The same operation is much more complicated to achieve in Synapse and requires a prior set-up of ODBC drivers, which is different for every UNIX architecture. BigQuery receives another point from me.

Stats so far: Synapse: 0, BigQuery: 4

Syntax/limitations:

The immediate difference we spot right away is — SQL in BigQuery does not equal SQL in Synapse. The default query language in the latter one is Microsoft Transact-SQL (T-SQL), but if you want to write queries in Kusto Query Language (KQL), you are more than welcome. BigQuery, on the other hand bases upon Google Standard SQL dialect, which ceaselessly since 2016 is a standard.

This all involves the small changes like the way of defining BOOLEAN types (BIT (1, 0) in Synapse, BOOLEAN (TRUE, FALSE) in BigQuery), through the lack of few useful methods existing only in BigQuery (such as EXCEPT, REGEXP_REPLACE, USING, etc) up to the way of saving external tables.

From the perspective of a BigQuery user, a lack of these methods is already noticeable in an attempt of selecting every column apart from THIS ONE. Let’s consider this case: your source table has 250 columns, but within your CTE, you only want to list 230 of them. Doing this in GCP would look as simple and neat as this: SELECT * EXCEPT(column_1, column_2, column_3, …, column_20) FROM source_table.

Absence of an EXCEPT method (nor its equivalent) forces us to list all of them manually within the query. To not make it sound that dramatic, Synapse has a feature listing them down for us (right click on the source file, and choosing “Select TOP 100 rows” ). As all 250 columns were selected, we now need to manually remove the 20 we are not interested in. Sadly, such a bypass does not work within the CTE, and either we have the patience level above the average human being listing all of the columns manually, or we need to think of some smarter solution I have not found yet.

Many people will find the syntax of SQL in BigQuery more user-friendly, as in contrast to Synapse, it hides the underlying complexity and users can focus on writing the actual SQL straight away. For example — if we want to create an external table from the CSV file in BigQuery, all we need to do is create a dataset and then run the following statement:

CREATE EXTERNAL TABLE sample_dataset.sample_table OPTIONS (format = ‘CSV’,uris = [‘gs://bucket/our_source_csv_file.csv’])

Synapse on the other hand, requires prior creation of a specific Database and File Format. On top of that, it is mandatory to define schema within the CREATE statement. And although Synapse can write them for you (when you click on the name of the .csv file within the Data tab), all these fields will require supervision and probably modification to the desired filed type and length. In the scenario of having a big number of columns, things are getting pretty exhausting again.

CREATE EXTERNAL TABLE [sample_dataset].[sample_table]([Field1] nvarchar(256),[Field2] nvarchar(256),[Field3] bigint,(...)[Field70] float)
WITH (LOCATION = 'DummyData.csv',DATA_SOURCE = [sample_datasource_dfs_core_windows_net],FILE_FORMAT = [CSVFormat])

Although this kind of granularity may seem redundant, and only complicate something that could have been done easier, it also offers more customisability options, the practised user can take advantages from. If only Synapse was not lacking the Except statement, I would grant both services by one point, however, from the perspective of BigQuery user, this lack causes just too much inconvenience.

Stats so far: Synapse: 0, BigQuery: 5

Summary:

Losing the scoring competition, Synapse seems like a bad choice to start working with, which it’s not. It’s still very powerful tool with a decent speed that can be used with a great effect in various data analysis operations. With BigQuery, however, you can achieve the same thing quicker, simpler and probably cheaper. Nevertheless, unless you are an individual, the choice you make, will most likely be determined on the IT architecture your company already use..

For more insight follow datasparq-technology, and find us on LinkedIn, Twitter, and our website!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store