MLearning.ai
Published in

MLearning.ai

Optimize parquet file size in Spark and ingest into Azure data explorer using Azure Synapse Spark

Optimize parquet file size in 1GB chunks for analytics

pre-requisites

Synapse Spark

from pyspark.sql import SparkSession
# Azure storage access info
blob_account_name = 'xxxxxxx' # replace with your blob name
blob_container_name = 'xxxxxxxx' # replace with your container name
blob_relative_path = '' # replace with your relative folder path
linked_service_name = 'BenchMarkLogs' # replace with your linked service name
#blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name)
blob_sas_token = mssparkutils.credentials.getSecret("iamkeys", "benchmarklogs")
# Allow SPARK to access from Blob remotelywasb_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
print('Remote blob path: ' + wasb_path)
df = spark.read.parquet(wasb_path)
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")
spark.conf.set("spark.sql.files.maxPartitionBytes", "1073741824")
print(spark.conf.get("spark.microsoft.delta.optimizeWrite.binSize"))
print(spark.conf.get("spark.sql.files.maxPartitionBytes"))
df.write.mode("overwrite").parquet("abfss://containername@storagename.dfs.core.windows.net/bechmarklogs1/")
kustoappid = mssparkutils.credentials.getSecret("iamkeys", "kustoappid")
kustosecret = mssparkutils.credentials.getSecret("iamkeys", "kustosecret")
kustotenant = mssparkutils.credentials.getSecret("iamkeys", "kustotenant")
df1 = spark.read.parquet("abfss://container@storagename.dfs.core.windows.net/bechmarklogs1/")
df1.write. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster","clustername.region"). \
option("kustoDatabase","Benchmark"). \
option("kustoTable", "logspark"). \
option("kustoAadAppId",kustoappid). \
option("kustoAadAppSecret",kustosecret). \
option("kustoAadAuthorityID",kustotenant). \
option("tableCreateOptions","CreateIfNotExist"). \
mode("Append"). \
save()
df1.write \
.format("com.microsoft.kusto.spark.synapse.datasource") \
.option("spark.synapse.linkedService", "linkedsvcname") \
.option("kustoDatabase", "Benchmark") \
.option("kustoTable", "logspark") \
.option("tableCreateOptions","CreateIfNotExist") \
.mode("Append") \
.save()

--

--

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 28K+ Unique DAILY Readers 🟠

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store