How Switching to Azure Synapse Link Delivers Cost Savings and Enhanced Performance

Nadeem Khan(NK)
LearnWithNK
Published in
6 min readSep 7, 2024

--

In this blog, we’re diving into how switching to Azure Synapse Link can really save you money and boost your data performance. We’ll walk through the main strategies Azure Synapse Link uses to make read operations faster and storage more efficient. By the end, you’ll see exactly how these optimizations can bring real-world benefits to your data management.

How does Azure Synapse Link provide optimized Read Operation?

Before understanding how Azure Synapse link provides optimised read operation, let us first understand what is needed for optimized Read Operation.

What is needed for Optimized Read Operation?

For any dataset, if we need to have optimized read operation,

Smaller Datasets
With smaller datasets, less amount of data will be transferred over a network which means faster reading. To manage smaller datasets, it’s important to choose the correct data types and compress your files. This approach not only saves storage space but also enhances overall performance.

Optimized Filtration
Your processing engine should only pull the data you actually need, and it’s best if this happens at the storage level rather than during processing. For instance, if you need just 3 columns out of 20, the filtering should be done right at the storage stage. The same principle applies to any data filtering — handling it at the storage level is more efficient because it reduces the amount of data travelling across the network, leading to faster reads. You can achieve this by using the correct collation and optimized file formats.

Smaller Scans
To boost efficiency further, store the appropriate metadata alongside your data files and ensure each file is optimally sized to distribute the data evenly. This setup allows the processing engine to quickly identify and read only the necessary files, skipping those that aren’t needed.

Parallel Read and Write Operations
The processing engine should be capable of performing efficient read operations without being affected by ongoing write activities. To achieve this, it’s important to isolate read operations from any write processes, ensuring smooth performance for both. Now we know, what is needed to achieve optimized reading. Let us see, how Azure Synapse Link Spark does it.

How does Azure Synapse Link do it?

Smaller Datasets
Synapse Link runs DeltaLakeConversion jobs to create Delta Parquet files from timestamped incremental CSV files. These Delta Parquet files are compressed and can be nearly ten times smaller than the original CSV files, helping to save storage space.

Optimized Filtration
Delta Parquet files are partitioned by year, allowing for efficient filtering of specific rows. Additionally, since Delta Parquet files are an extension of Parquet files, which are column-oriented, you can fetch only the required columns. Both row-level and column-level filtering happen at the storage level, which boosts efficiency.

Smaller Scans
The Delta Parquet format stores metadata for each file in transaction logs, enabling the processing engine to read only the necessary files and skip the others. Additionally, Azure Synapse Link runs DeltaLakeMaintainence jobs that combine smaller files into one optimal Parquet file, usually ranging from 128 MB to 1 GB. These jobs utilize the OPTIMIZE command from the Spark toolbelt to group smaller files into a single, optimally-sized Parquet file. This consolidation leads to smaller, more efficient scans

Parallel Read and Write Operations
Delta Parquet provides Snapshot Isolation, which allows the processing engine to read only from successfully committed snapshots. Meanwhile, write operations add new records to a different snapshot. This separation ensures that both read and write operations can run in parallel without impacting each other.

Link to Fabric supports V-Order, which offers further read optimization by arranging data in a way that enhances performance. However, V-Order is not available in Azure Synapse Link.

How does the Azure Synapse Link optimize Data Lake storage cost and performance?

Let us first lay out the factors responsible for cost and performance:

Factors

Data Storage Strategy

The answer seems pretty straightforward: store less data, and you’ll save money. But obviously, we can’t just go around deleting important data to cut costs. Instead, we should store data in a compressed format and get rid of any unnecessary redundancy. On top of that, it’s a good idea to have a cleanup process that automatically deletes unused files based on certain conditions, so we’re only keeping what we really need.

Data Access Strategy

Storing data in a compressed format is a smart move for saving costs, plus it speeds things up by reducing the amount of data that needs to be transferred over the network. It’s also important to pick the right file extensions and keep everything well-organized. We want our files to be optimized for both reading and writing, so downstream systems can work smoothly without getting in each other’s way. That way, multiple systems can access the storage without causing any problems.

Another key point is to develop a strategy for accessing data only when it’s actually needed. Accessing data too often can get expensive if we’re not careful, so we should keep an eye on how frequently we’re making requests to the storage account. This way, we can manage costs while still keeping everything running efficiently.

How does Azure Synapse Link do it?

Data Storage Strategy

Azure Synapse Link converts timestamped incremental CSV files into a Delta Parquet format, making them almost 10 times smaller than the original CSVs, thanks to DeltaLakeConversion Spark jobs. Plus, this Spark job automatically deletes the CSVs once the data has been captured in the Delta Parquet format.

In addition, Synapse Link schedules DeltaLakeMaintainence jobs to keep things running smoothly. These jobs reorganize smaller files into optimal larger ones and delete snapshots older than the retention period. This process, known as OPTIMIZE and VACUUM, ensures your data stays efficient and clutter-free.

Data Access Strategy

The Delta Parquet files created by Azure Synapse Link are optimized for both reading and writing. They offer snapshot isolation, which means you can read from the last successfully committed snapshot while Azure Synapse Link continues writing to different snapshots.

Additionally, Azure Synapse Link allows users to set how often Spark jobs read and write to the data lake through the Synapse Link Profile, with the shortest interval being 15 minutes. This gives you better control over how the Data Lake is accessed and managed.

The answer

Yeah, switching to Azure Synapse Link can definitely make a difference! By converting those big CSV files into compressed Delta Parquet files, you’re looking at potentially cutting storage costs by up to ten times. It also helps with performance since it makes reading data faster with efficient filtering and smaller scans. Plus, you won’t have to worry about any slowdowns because the parallel read and write operations keep everything running smoothly. And with maintenance tasks like OPTIMIZE and VACUUM, your data stays organized and clutter-free, which only adds to the efficiency.

Summary

  • Azure Synapse Link automatically schedules two Spark jobs: DeltaLakeConversionJob and DeltaLakeMaintenanceJob.
  • The Delta Lake Conversion job handles converting timestamped incremental CSVs into Delta Parquet format and deletes the original CSVs once the data is safely stored in Delta Parquet.
  • The Data Lake Maintenance job takes care of optimizing and cleaning up the data. It reorganizes smaller Parquet files into one optimal-sized file and deletes any older snapshots that exceed the retention period through the OPTIMIZE and VACUUM processes.
  • To optimize cost and performance, it’s a good idea to create multiple Synapse Link Profiles based on how frequently you want to export data from your tables.

If you found this guide helpful, don’t forget to clap 👏 and share it with your network! Have questions or experiences with D365 F&O data exports? Drop a comment below — I’d love to hear your thoughts and help clarify any confusion. Stay tuned for more in-depth blogs on Azure Synapse Link, where we’ll dive deeper into its inner workings and share key insights — especially if you’re migrating from Export to Data Lake. For the latest updates and expert tips on Microsoft’s evolving data solutions, follow me on Linkedin, Github and Medium.

Thanks for reading. Happy Learning.

--

--

Nadeem Khan(NK)
LearnWithNK

Lead Technical Architect specializing in Data Lakehouse Solutions with Azure Synapse, Python, and Azure tools. Passionate about data optimization and mentoring.