Azure Synapse Analytics Serverless SQL Pool Guidelines

How to achieve the best performance and a smooth experience by working with serverless SQL pools for building successful data lakes

Patrick Pichler

Published in

Creative Data

4 min readAug 20, 2021

Introduction

With the introduction of the serverless SQL pool as a part of Azure Synapse Analytics, Microsoft has provided a very cost-efficient and convenient way to drive value from data residing in lakes using simple T-SQL statements. It enables you to easily build logical analytical models by querying and joining data across heterogeneous sources making the development of complex data integration pipelines obsolete in many cases. To use it, you don’t even need to explicitly provision it beforehand due to its serverless nature, it is per default part of an Azure Synapse Analytics workspace. All you have to do is query data in an on-demand fashion in which you get charged according to the amount of data your queries need to process. Yet, the flexibility provided in terms of how data can be stored and queried require you to stick to some conventions for properly applying all its features and functionalities. Otherwise, the once promising serverless query engine can end up causing lots of costs together with a poor performance. To avoid this, Microsoft has already shared some of their best-practices¹. This article is an extension to this keeping the information provided intentionally brief and easy to read. It should serve as a quick overview of all you need to know to get the most out of serverless SQL pools up from the beginning.

Guidelines

✔️ Provision all Azure services involved in the same region and close to your client applications in case of local access

✔️ Keep your underlying Azure storage account free from any other heavy workloads during query execution

✔️ Make usage of Views instead of External Tables if you wish to cast simple structures over your data lake as they generally allow more flexibility

✔️ Use Parquet/Delta as your primary file/table format

✔️ In case you need to build upon CSV files, creating statistics manually and changing the parser version to 2.0 if your data allows it

✔️ Partition your data according to upcoming query patterns and keep the file size reasonable, it is recommended to be between 100 MB and 10 GB

✔️ Make usage of the filepath() function for partition pruning in case of building upon CSV files or Views created with Parquet format

✔️ Use Shared Access Signatures (SAS) for authentication as the response time is usually better compared to Azure AD pass-through authentication

✔️ To enable interactive analytics on top of your data lake, load data further into analytical databases such as Analysis Services or Power BI

✔️ Materialize views using CREATE EXTERNAL TABLE AS SELECT (CETAS) in case of repeating queries

✔️ Use Latin1_General_100_BIN2_UTF8 (database config or WITH clausel) as your default collation to avoid any unexpected text conversion errors and to leverage predicate push-down

✔️ Use ADS or SSMS for querying huge data sets since the web-based Azure Synapse Studio is not made for returning huge amounts of data or consider limiting result sets

✔️ In case of using Apache Spark, consider creating databases and Parquet tables in Spark since they will automatically be available in the serverless SQL pool even if the Spark pool is turned off

✔️ Keep an eye on all the triggered SQL requests by either using the “SQL request” section within Azure Synapse Studio or via T-SQL using the sys.dm_exec_requests view

✔️ Limit the budget for the amount of data being processed by your queries to avoid unexpected cost explosions

✔️ Use the QPI library for inspecting queries or troubleshooting performance

Conclusion

The resources provided by the serverless SQL pool including the attached data lake storage should be utilized wisely, otherwise you could suffer a double penalty. On the one hand, long response times including unhappy users along with high costs caused by making unnecessary and inefficient full scans of lots of data. It is hence very important to be aware of the above-mentioned optimizations to be on the right track from the very beginning. Organizing and partitioning the data properly according to the queries’ search predicates is probably one of the most important factor to bear in mind. It makes the usage of serverless SQL pools much more efficient by making it read only relevant data. I will give my best to keep this guidelines up-to-date as the service evolves and new features get added or new findings have been made. Hopefully, it helps you creating a successful data lake on Azure.

Resources

[1] Microsoft. 2021. Best practices for serverless SQL pool in Azure Synapse Analytics. [ONLINE] Available at: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool. [Accessed 19 August 2021].