Starschema Blog
Published in

Starschema Blog

Testing SQL Pool Performance in Azure Synapse Analytics

Azure Synapse Analytics (SA) went GA (general availability) last December, introducing a remarkable range of functionality that provides a PaaS database and a complete platform for data engineers, data scientists and business users with SQL or Spark capabilities.

Photo by Tobias Fischer on Unsplash

The feature set of SA is considerably richer than that of a “plain” old Azure SQL Database, but what benefits do we get from the smallest dedicated pool? With this question in mind, I decided to run a test to see if a very moderate-sized pool (DWH100c) can compete against an Azure SQL Database scaled to a similar price point.

Of course, I know that SA is a massively parallel processing (MPP) database, and Azure SQL Database is not, and I could have made the same technical shootout between an SQL database and the former SQL data warehouse, but the new pricing and software bundle have made the comparison even more timely and appropriate. Right now, if you are using an on-demand SQL pool or on-demand Spark cluster, Synapse’s pricing can seem very competitive. According to my experience, in an enterprise environment, 400–800DTU-sized databases are usually the minimum, but that’s precisely why I’m interested to see if the smallest SQL pool can achieve comparable, or at least decent performance.

So I spun up the databases. For Azure SQL Database, I used a standard 800DTU, which is approximately the same price as the Synapse DW100c instance noted above. I also tried a Premium P2 250 tier that confirmed my assumption — but more on that later. For testing purposes, this gives us an idea of how well — or poorly — the small Synapse dedicated SQL pool performs.

I used Microsoft’s New York TAXICab dataset for the test.

The original code was tailor-made for Synapse, so I just ran the scripts and eventually ingested some of the tables (Trip, Date and Time) into the SQL database with a pipeline from the Synapse workspace. Sidenote: the data ingestion took around eight minutes in Synapse and 91 minutes in the 800DTU SQL database — yes, the Trip data was massive and ultimately consumed 32GiB in the SQL Database as a non-compressed, row-storage table.

Initially, I did not use any indexes to see the sheer performance difference. It was immediately obvious that there is no real competition between the two (three, with Premium tier included) databases at this data amount, as you will see. I executed each SQL statement multiple times — here are the average times:

The first one is a record count:

SELECT COUNT(1) AS r_count from [dbo].[Trip]

Synapse gave the 100M record answer in around three seconds.
The Azure SQL database @ DTU800 responded in 200+ seconds — the deviation was huge, and sometimes it took over 10 minutes to finish… However, the Premium P2 250 came in at around 77 seconds.

SELECT [DateID],
COUNT(1)
FROM [dbo].[Trip]
GROUP BY dateid
ORDER BY dateid

Synapse finished in 10 seconds, Azure SQL @ DTU800 in 62 seconds, Premium P2 250 in 51 seconds.

At this point, I felt that the comparison was just not fair. As we concluded earlier, standard DTUs are very I/O-limited. Sometimes, when a query executes with decent performance, we still struggle to ingest data into standard DTU databases. Premium is a bit better but at a significantly higher cost. In addition to this, let’s introduce the great columnar storage weapon to the standard DTU and see that result as well:

SELECT COUNT(1)
FROM [dbo].[Trip] t
LEFT JOIN [dbo].date d ON t.dateid=d.dateid

Synapse finished in seven seconds, Azure SQL @ DTU800 in 103 seconds (and in <1 second with columnar storage), while Premium P2 250 finished in 54 seconds.

SELECT count(1)
FROM [dbo].[Trip] t
LEFT JOIN [dbo].[date] d ON t.dateid=d.dateid
LEFT JOIN [dbo].[time] pcktm ON t.pickuptimeid=pcktm.timeid
LEFT JOIN [dbo].[time] drptm ON t.dropofftimeid=drptm.timeid

Synapse finished in 44 seconds, Azure SQL @ DTU800 in 266 seconds (and in around three seconds with columnar storage). Premium P2 250 finished in 60 seconds.

SELECT pcktm.hournumber,
AVG(cast(t.TripDurationSeconds AS bigint))/60 AS Avg_tripduration_min
FROM [dbo].[Trip] t
LEFT JOIN [dbo].[date] d ON t.dateid=d.dateid
JOIN [dbo].[time] pcktm ON t.pickuptimeid=pcktm.timeid
GROUP BY pcktm.hournumber
ORDER BY pcktm.hournumber

Synapse finished in seven seconds, Azure SQL @ DTU800 in 60 seconds (and four seconds with columnar storage) and Premium P2 250 in 59 seconds.

So, what does this tell us? First, the Synapse DW100c offers very good value: it has decent performance but, being an MPP database, it also has its share of disadvantages — for example, don’t use it while joining huge tables with different distributions, as it would also lead to performance degradation.

The Azure SQL Databases are quite I/O-limited, even at 800DTU. Still, as I showed above, columnar storage mode does the trick here as well — albeit while compromising ETL performance, since daily updates and deletes are executed with higher overhead. At a higher price point, Premium tier provides additional services (IOPS, sizes) that can make a positive difference in performance, but it’s the richer feature set that really makes it an appealing option.

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.

Recommended from Medium

Virtual Commissioning: True Added Value for Every Company

A Guide to Instrument Sanic Application, Part 1

Solving Your Time Consuming Manual Workflow

Tips and tricks on how to clear our automated coding round

How to efficiently read files in a compressed archive

Another Wave, the Same Mistakes

Stock Tracking Telegram Bot 2

List Size Check Chain Pattern in Kotlin

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andras Kovacs

Andras Kovacs

Andras is a Solution Architect at Starschema, specializing in BI, SQL and DataLakes … and finding solutions.

More from Medium

Run DBT in Azure Data Factory: A Clean Solution for Azure Cloud

Querying One Trillion Rows of Data with PowerBI and Azure Databricks

Azure Synapse — How to use Delta Sharing ?

Integrating Azure Synapse with DBT