COSMOS FOR SYNAPSE LINK — KEY LESSONS LEARNT

Best Practices for using Cosmos for Synapse link using Synapse Serverless pools

RK Iyer

Published in

Microsoft Azure

4 min readDec 3, 2022

Best Practices for using Cosmos for Synapse link using Synapse Serverless pools

❑ Background

In the last blog, we saw how Synapse Serverless SQL pools & Spark pool with Synapse Link provides a cost-effective analytics solution for analyzing NoSQL data stored in Cosmos DB.

We looked at the internals of Cosmos DB Synapse Link, its benefits & use case patterns. In this blog, we will see how Synapse link can be leveraged along with the best practices based on my implementation experience in different customer projects.

I have captured the best practices and key learnings across various key pillars — Cost Optimization, Performance, Operational excellence & Security.
Please note that we are focusing on the serverless SQL pool in Azure Synapse Analytics & Cosmos for Synapse link in this blog.

❑ Performance

Ensure that your Azure Cosmos DB analytical storage is in the same region as serverless SQL pool.
Ensure that your client application (Power BI, Analysis service) is in the same region as serverless SQL pool.
If you are returning a large amount of data (bigger than 40–80 GB), consider using caching layer such as Analysis services or Power BI Import mode.
It has been observed that Serverless SQL pools can’t provide an interactive experience in Power BI Direct Query mode if you’re using complex queries or processing a large amount of data. It is recommended to do a PoC to check the performance.
If you are filtering data using string columns, make sure that you are using the OPENROWSET function with the explicit WITH clause that has the smallest possible types (for example, don't use VARCHAR (1000) if you know that the property has up to 5 characters).
Although Synapse Studio is an easy-to-use tool, but it uses HTTP protocol over a standard internet connection to retrieve the data. Synapse Studio is good for small data, but it is recommended to use native tools that uses the latest native driver for retrieving a large amount of data. SSMS or ADS are preferred client tools for retrieving a large amount of data.
You should use some UTF-8 collation as default database collation or set it as explicit column collation to avoid UTF-8 conversion issue. Collation Latin1_General_100_BIN2_UTF8 provides best performance when you filter data using some string columns.
If you have repeating queries that might be cached, try to use CETAS to store query results in Azure Data Lake Storage.You can use CETAS to materialize frequently used parts of queries, like joined reference tables, to a new set of files. You can then join to this single external table instead of repeating common joins in multiple queries.

As CETAS generates Parquet files, statistics are automatically created when the first query targets this external table. The result is improved performance for subsequent queries targeting table generated with CETAS.

❑ Operational Excellence

Although analytical store data is not backed up, and therefore cannot be restored, you can rebuild your analytical store by re-enabling Synapse Link in the restored container. Check the analytical store documentation for more information.

Currently Synapse Link isn’t fully compatible with continuous backup mode. Check the analytical store documentation for more information.

SQL Serverless pools have constraints and the best way to scale the workload is to scale out by creating more workspaces that can be used in parallel.

It is highly recommended to do a performance testing based on concurrent queries, expected query patterns know the limits.

❑ Security

You can perform Network isolation using separate managed private endpoints for each store, within managed virtual networks in Azure Synapse workspaces.
You can seamlessly encrypt the data across transactional and analytical stores using the same customer-managed keys in an automatic and transparent manner.
Azure Synapse Link only supports configuring customer-managed keys using your Azure Cosmos DB account’s managed identity. You must configure your account’s managed identity in your Azure Key Vault access policy before enabling Azure Synapse Link on your account.
Do not use Azure Cosmos DB account keys inline in Spark jobs or SQL scripts. You can pre-create SQL credentials storing the account keys and referencing these in the OPENROWSET function. Refer Open Rowset credentials for more details.

❑ Cost Optimization

Analytical store follows a consumption-based pricing model where you’re charged for: Storage (Volume of the data retained in the analytical store), Analytical write operations (auto-sync) & Analytical read operations (from Azure Synapse Analytics Spark pool and serverless SQL pool run times).

Analytical store pricing is separate from the transaction store pricing model. There’s no concept of provisioned RUs in the analytical store.

Use the Azure Cosmos DB Capacity planner and get an estimate of your analytical storage and write operations costs.
A proof-of-concept around the analytical workload would provide a finer estimate of analytical read operations.

References -

Best practices for serverless SQL pool — Azure Synapse Analytics | Microsoft Learn

Serverless SQL pool self-help — Azure Synapse Analytics | Microsoft Learn

Always use UTF-8 collations to read UTF-8 text in serverless SQL pool — Microsoft Community Hub

Configure private endpoints for analytical store article.

Configure customer-managed keys using Azure Cosmos DB accounts’ managed identities .

I would like to thank Rodrigo Souza for reviewing the content.

I hope this blog helped you in understanding the best practices of Cosmos DB Synapse Link.

Happy Learning!!!

Please Note — All opinions expressed here are my personal views and not of my employer.