Optimizing Copy Activity Data in Microsoft Fabric

This blog talks about how you can optimize data in Microsoft Fabric after running a copy activity pipeline in data warehouse experience.

Cryptorators
thecloudtech
3 min readJul 10, 2023

--

I have been trying out Microsoft Fabric these days and playing with different features of it. Today, I will share my experience and insights on using Microsoft Fabric, a powerful platform for data and AI. In this post, I will show you how to copy data to Microsoft Fabric and optimize it for better performance.

OneLake is a cloud-based data lake service that lets you store and access large amounts of structured and unstructured data. Microsoft Fabric is a unified analytics engine that can run on top of OneLake and perform various operations on the data, such as transformations, aggregations, machine learning, and more.

To understand how one can copy data and access it in OneLake, I tried running a copy activity pipeline, which is a graphical tool that allows you to configure the source, destination, and settings of your data transfer. For this example, I used the sample taxi data provided in Microsoft Fabric, which is about 2 GB in size.

One of the settings that you need to pay attention to is the v-ordering option when configuring copy activity. I highly recommend checking that option as it allows faster reads for Microsoft Fabric compute engines.

Another setting that you need to consider is the table format. Microsoft Fabric uses Delta Lake as its unified table format, which supports ACID transactions, schema evolution, and time travel. You can choose to create a table directly from the copy activity pipeline, or write your data as files and then create a table later. I configured the pipeline to create a table directly.

After running the copy activity pipeline, I checked the data that it copied and noticed that it created several parquet files with different sizes, ranging from 20 MB to 57 MB.

Files created by copy Activity

Since Microsoft Fabric uses Delta Lake as its unified table format, its recommended to have file sizes closer to 1 GB each for optimum performance. In order to achieve compaction, you need to run the OPTIMIZE command which will merge all the files for the table into larger files. The default file size when you run OPTIMIZE command is 1 GB.

OPTIMIZE command created larger files

After running the OPTIMIZE command, check the file size again and you will see new files created with larger file size in addition to the already existing smaller files.

This bring us to cleaning up the files which are no longer referenced by our table. For that, you need to run VACUUM command which will clean up files older than 30 days (by default) which are no longer needed by the table. If your copy activity pipeline was run within the last 30 days, you will not see any files being removed.

Conclusion

In this post, I showed you how you can perform compaction on the data written by Microsoft Fabric in a copy activity and optimize it for better performance. You learned how to use the copy activity pipeline, the v-ordering option, the Delta Lake table format, and the OPTIMIZE and VACUUM commands.

I hope you found this post useful and informative. Please follow me for more similar content and leave your feedback in the comments section below.

--

--

Cryptorators
thecloudtech

#Followforfollow #medium #follo4wfollow #cloud #technicalblog