JSON Parsing Using Variant Data Type in Databricks SQL

Published in

DBSQL SME Engineering

10 min readJul 9, 2024

https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark

Introduction

Databricks recently introduced a new data type, variant, available in Databricks Runtime 15.3 and above. This runtime will be available in Databricks SQL Warehouses in August, 2024. Variant data type is used for storing semi-structured data, like JSON. This blog explains the reasons behind adding this data type and shows how to use it

Tough choice: performance vs. flexibility

We always were able to parse JSON strings in Databricks SQL. We had two imperfect ways of doing that:

Not flexible: Parse the JSON string and load it into the table. In this case, either the table’s schema has to be defined in advance, or we should rely on schema inference, which has its limitations. This approach doesn’t handle the flexibility of the JSON format well: missing data elements, changing types, and duplicated element names on different levels may cause errors and require manual interaction
Slow: Load JSON into a table as a string and parse this JSON string each time when the table is queried. This approach is very flexible but will slowdown all queries against such table

Previously we had to choose between flexibility and performance. The introduction of the variant data type offers a perfect solution. It addresses both issues, now we can store JSON as a whole similarly to a string and don’t need to parse it during the execution

Performance

According to the internal benchmarking by Databricks variant data type is 8 to 20 times more performant than JSON string — please refer to the variant type introduction presentation on Databricks Data+AI Summit 2024. Further plans for performance improvement by implementing the variant shredding are also highlighted in this presentation.

Open source

Unlike the competition’s, Databricks variant data type is fully open-sourced, no proprietary data lock-in

According to the Databricks Engineering blog Introducing the Open Variant Data Type in Delta Lake and Apache Spark, the variant data type, binary expressions, and binary encoding format are already merged in Apache Spark, variant support will be included in Spark 4.0 and Delta 4.0.

Complex JSON example

Observability platform Datadog natively supports Databricks. Datadog offers a templated dashboard to monitor Databricks clusters performance. This dashboard can be exported as a JSON file, copied, customized and imported back to Datadog. Datadog Databricks dashboard template is a complex JSON containing nested arrays and structs:

{"title":"Databricks Overview","description":"This Dashboard provides a high-level overview of your Databricks clusters, alongside detailed Spark metrics and logs, so you can monitor your jobs in real-time.\n\n- [Monitor Databricks with Datadog](/blog/databricks-monitoring-datadog)\n- [Databricks integration docs](https://docs.datadoghq.com/integrations/databricks/?tab=driveronly)\n- [Spark integration docs](https://docs.datadoghq.com/integrations/spark/?tab=host)\n","widgets":[{"id":2863712559104604,"definition":{"title":"New group","banner_img":"https://trello-attachments.s3.amazonaws.com/5c747d70330c3550f87cd622/609ee046c72dcc1731c5fffe/ed221b6aa89e83bd08717b397038ea00/Databricks-Agent_integration_210601_FINAL.png","show_title":false,"type":"group","layout_type":"ordered","widgets":[{"id":8008972579154892,"definition":{"type":"note","content":"This dashboard tracks the current status of your Spark jobs, tasks, and stages in your Databricks clusters. It also allows you to track the resource utilization of your nodes, which point to possible areas for optimization.\n\nSee the [Datadog Databricks integration documentation](https://docs.datadoghq.com/integrations/databricks/?tab=driveronly) to find the suitable init script for your Databricks cluster environment.\nClone this template dashboard to make changes and add your own graph widgets.","background_color":"white","font_size":"14","text_align":"left","vertical_align":"top","show_tick":false,"tick_pos":"50%","tick_edge":"left","has_padding":true},"layout":{"x":0,"y":0,"width":4,"height":6}}]},"layout":{"x":0,"y":0,"width":4,"height":12}},{"id":3285082222899335,"definition":{"title":"Resource overview","background_color":"vivid_purple","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":1618329382596682,"definition":{"title":"Jobs Running","title_size":"16","title_align":"left","time":{"live_span":"1h"},"type":"query_value","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"aggregator":"last","data_source":"metrics","name":"query1","query":"sum:spark.job.count{status:running,$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"scalar"}],"precision":0},"layout":{"x":0,"y":0,"width":2,"height":2}},{"id":7890892749181099,"definition":{"title":"Stages Running","title_size":"16","title_align":"left","time":{"live_span":"1h"},"type":"query_value","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"aggregator":"last","data_source":"metrics","name":"query1","query":"sum:spark.stage.count{status:active,$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"scalar"}],"precision":0},"layout":{"x":2,"y":0,"width":2,"height":2}},{"id":9194139560070256,"definition":{"title":"Clusters by name","title_size":"16","title_align":"left","type":"hostmap","requests":{"fill":{"q":"avg:system.cpu.idle{$databricks_cluster_name,$spark_node} by {host}"}},"node_type":"host","no_metric_hosts":true,"no_group_hosts":true,"scope":["$databricks_cluster_name","$spark_node"],"style":{"palette":"green_to_orange","palette_flip":false}},"layout":{"x":4,"y":0,"width":4,"height":4}},{"id":5152565111194215,"definition":{"title":"Tasks Running","title_size":"16","title_align":"left","time":{"live_span":"1h"},"type":"query_value","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"aggregator":"last","data_source":"metrics","name":"query1","query":"sum:spark.job.num_tasks{status:running,$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"scalar"}],"custom_unit":" ","precision":0},"layout":{"x":0,"y":2,"width":2,"height":2}},{"id":3618572502468341,"definition":{"title":"Executors","title_size":"16","title_align":"left","time":{"live_span":"1h"},"type":"query_value","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"aggregator":"last","data_source":"metrics","name":"query1","query":"sum:spark.executor.count{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"scalar"}],"precision":0},"layout":{"x":2,"y":2,"width":2,"height":2}}]},"layout":{"x":4,"y":0,"width":8,"height":5}},{"id":824869120026588,"definition":{"title":"Cluster resource usage","background_color":"purple","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":3036208416490322,"definition":{"title":"CPU percent utilization","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:system.cpu.system{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":0,"width":4,"height":3}},{"id":4289444962903740,"definition":{"title":"System load","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:system.load.1{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:system.load.5{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:system.load.15{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":4,"y":0,"width":4,"height":3}},{"id":8128390498428924,"definition":{"title":"System memory","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:system.mem.free{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"purple","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:system.mem.total{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"grey","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":3,"width":4,"height":3}},{"id":4310912565479094,"definition":{"type":"note","content":"Monitor resource usage on your clusters to see how your jobs are performing. \n\nIf CPU utilization and system load are consistently high, you should consider [re-configuring your clusters](https://docs.databricks.com/clusters/cluster-config-best-practices.html) to fit your workload.","background_color":"gray","font_size":"14","text_align":"left","vertical_align":"top","show_tick":false,"tick_pos":"50%","tick_edge":"left","has_padding":true},"layout":{"x":4,"y":3,"width":4,"height":3}}]},"layout":{"x":4,"y":5,"width":8,"height":7}},{"id":7574260271112913,"definition":{"title":"Spark events and logs","background_color":"yellow","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":8862537986282615,"definition":{"type":"note","content":"The Datadog init scripts support logs configuration to collect and monitor your [Databrick's driver logs](https://docs.microsoft.com/en-us/azure/databricks/clusters/clusters-manage#driver-logs). \n\nYou can also modify the init scripts to configure additional logs instances to read from other log files in your environment. ","background_color":"white","font_size":"14","text_align":"left","vertical_align":"top","show_tick":false,"tick_pos":"50%","tick_edge":"left","has_padding":true},"layout":{"x":0,"y":0,"width":12,"height":2}},{"id":2957620172826090,"definition":{"title":"Spark Application Event Counts","title_size":"16","title_align":"left","time":{"live_span":"1d"},"type":"event_timeline","query":"spark has status $spark_node $databricks_cluster_name $app_name"},"layout":{"x":0,"y":2,"width":12,"height":2}},{"id":4462948139477416,"definition":{"title":"","title_size":"16","title_align":"left","requests":[{"response_format":"event_list","query":{"data_source":"logs_stream","query_string":"service:databricks $databricks_cluster_name $spark_node $app_name","indexes":[],"storage":"hot","sort":{"order":"desc","column":"timestamp"}},"columns":[{"field":"status_line","width":"auto"},{"field":"timestamp","width":"auto"},{"field":"host","width":"auto"},{"field":"service","width":"auto"},{"field":"content","width":"compact"}]}],"type":"list_stream"},"layout":{"x":0,"y":4,"width":12,"height":5}}]},"layout":{"x":0,"y":12,"width":12,"height":10}},{"id":7808976241892319,"definition":{"title":"Spark node resources","background_color":"vivid_orange","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":6223483355829335,"definition":{"title":"Driver disk used","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.driver.disk_used{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":0,"width":4,"height":2}},{"id":7442954647313823,"definition":{"title":"Driver memory used","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.driver.memory_used{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":4,"y":0,"width":4,"height":2}},{"id":3346335625888625,"definition":{"type":"note","content":"Monitor the resource consumption on your Spark driver and worker nodes.\n\nIf there is high resource consumption on your clusters, consider choosing a larger Driver or Worker type for the cluster. Read more about configuring your [Databricks clusters](https://docs.databricks.com/clusters/configure.html).","background_color":"gray","font_size":"14","text_align":"left","vertical_align":"center","show_tick":true,"tick_pos":"50%","tick_edge":"left","has_padding":true},"layout":{"x":8,"y":0,"width":4,"height":4}},{"id":5781216438631193,"definition":{"title":"Executor disk used","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.executor.disk_used{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":2,"width":4,"height":2}},{"id":9198156898567786,"definition":{"title":"Executor memory used","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.executor.memory_used{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":4,"y":2,"width":4,"height":2}}]},"layout":{"x":0,"y":22,"width":12,"height":5,"is_column_break":true}},{"id":9309742160723324,"definition":{"title":"Spark job status","background_color":"orange","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":2863420659357408,"definition":{"type":"note","content":"Monitor your Spark jobs, stages, and tasks by status.\n\nIncrease in stage failures may signify there is a problem with processing a Spark task like incorrect Spark configuration or code errors. Correlate the failures with your driver logs to observe for an exception.","background_color":"gray","font_size":"14","text_align":"left","vertical_align":"center","show_tick":true,"tick_pos":"50%","tick_edge":"right","has_padding":true},"layout":{"x":0,"y":0,"width":3,"height":4}},{"id":6849390036288253,"definition":{"title":"Jobs by status","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.count{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"bars"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.count{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"purple","line_type":"solid","line_width":"normal"},"display_type":"bars"},{"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"avg:spark.job.count{status:failed,*,databricks_cluster_name:databricks-demo-2,*,$spark_node,$app_name,$databricks_cluster_name}"}],"response_format":"timeseries","style":{"palette":"warm","line_type":"solid","line_width":"normal"},"display_type":"bars"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":3,"y":0,"width":4,"height":4}},{"id":7340109218599876,"definition":{"title":"Job stages","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_active_stages{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_failed_stages{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"warm","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_skipped_stages{$spark_node,$databricks_cluster_name,$app_name}.as_count()"}],"response_format":"timeseries","style":{"palette":"grey","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":7,"y":0,"width":5,"height":2}},{"id":1996515998286553,"definition":{"title":"Job tasks","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_tasks{status:running,$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_completed_tasks{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"purple","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_skipped_tasks{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"grey","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"on_right_yaxis":false,"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.job.num_failed_tasks{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"warm","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":7,"y":2,"width":5,"height":2}}]},"layout":{"x":0,"y":27,"width":12,"height":5}},{"id":1379622522881311,"definition":{"title":"Spark stage metrics","background_color":"orange","show_title":true,"type":"group","layout_type":"ordered","widgets":[{"id":2881826412136297,"definition":{"title":"Shuffle Read/Write Records","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:spark.stage.shuffle_read_records{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"avg:spark.stage.shuffle_write_records{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":0,"width":6,"height":3}},{"id":3253303748856452,"definition":{"title":"Shuffle Read/Write Bytes","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.shuffle_read_bytes{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.shuffle_write_bytes{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.shuffle_write_bytes{*,databricks_cluster_name:databricks-demo-2,*,$spark_node,$app_name,$databricks_cluster_name}"}],"response_format":"timeseries","display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":6,"y":0,"width":6,"height":3}},{"id":7279970450902530,"definition":{"title":"Stage Input/Output Records","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.input_records{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.output_records{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":0,"y":3,"width":6,"height":3}},{"id":4518783810371237,"definition":{"title":"Stage Input/Output Bytes","title_size":"16","title_align":"left","show_legend":false,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"time":{"live_span":"4h"},"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.input_bytes{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"},{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"metrics","name":"query1","query":"sum:spark.stage.output_bytes{$spark_node,$databricks_cluster_name,$app_name}"}],"response_format":"timeseries","style":{"palette":"dog_classic","line_type":"solid","line_width":"normal"},"display_type":"line"}],"yaxis":{"include_zero":true,"label":"","scale":"linear","min":"auto","max":"auto"},"markers":[]},"layout":{"x":6,"y":3,"width":6,"height":3}}]},"layout":{"x":0,"y":32,"width":12,"height":7}}],"template_variables":[{"name":"spark_node","prefix":"spark_node","available_values":[],"default":"*"},{"name":"databricks_cluster_name","prefix":"databricks_cluster_name","available_values":[],"default":"*"},{"name":"app_name","prefix":"app_name","available_values":[],"default":"*"}],"layout_type":"ordered","notify_list":[],"reflow_type":"fixed","id":23404}

We will use this JSON to illustrate variant Databricks SQL expressions added/updated along with the variant data type and common ETL JSON processing techniques. JSON is displayed here in one line to save space, to review it in human-readable format we recommend using one of the free online JSON formatters or Visual Studio Code

Variant expressions

Following expressions were added to Databricks SQL (schema_of_variant and schema_of_variant_agg, variant_get and try_variant_get, variant_explode, is_variant_null) or modified (cast, parse_json and try_parse_json, to_json) to handle the variant data type:

cast: to and from variant, :: is also available
is_variant_null: check if a variant value is a variant null. Returns true if and only if the input is a variant null and false otherwise (including in the case of SQL null. The variant null and SQL null are different)
parse_json and try_parse_json: JSON string to variant
schema_of_variant: returns the schema of a variant (string)
schema_of_variant_agg: returns the merged schema in the SQL format of a variant column
to_json: variant to JSON string
variant_get and try_variant_get: extract a sub-variant from variant according to path , and then cast the sub-variant to type
variant_explode: table function to unnest a variant, similar to posexplode

We will show how to use some of those functions to parse and unnest the Datadog Dashboard JSON above. We use Databricks SQL notebook on Runtime 15.3 all-purpose cluster with Photon enabled (AWS, r6id.xlarge both driver and workers, 4 workers, assigned, Unity Catalog enabled, no autoscaling)

Load JSON to a table

To load the sample JSON into a table, we have to create catalog, schema and volume (or use existing ones) first. It can be done in Databricks SQL using create commands or in the Databricks UI. In our example we created catalog variant_cat, schema variant_sch and volume variant_vol in SQL:

create catalog if not exists variant_cat;
create schema if not exists variant_cat.variant_sch;
create volume if not exists variant_cat.variant_sch.variant_vol;

For convenience, we use this catalog/schema use variant_cat.variant_sch and create a managed delta table variant_tbl in this catalog/schema withthe field variant_fld of type variant to store the JSON

create or replace table variant_tbl (variant_fld variant);

Using UI, we upload JSON, saved as a file (DatabricksDatadogDashboard.json) to the volume variant_vol and finally, we load this file to variant_tbl table using the copy command:

copy into variant_tbl 
from '/Volumes/variant_cat/variant_sch/variant_vol/DatabricksDatadogDashboard.json'
 fileformat = json
 format_options ('singleVariantColumn' = 'variant_fld')
 copy_options ('mergeSchema' = 'true');

The other way to load JSON to a table is to use an insert into command:

insert into variant_tbl (variant_fld) values ('some json string'::variant);

It would work for a simpler JSON string but in our case, JSON is too long and has special symbols, so the first way is preferable.

select * from this table returns:

To delete the row we just inserted, simple delete from variant_tbl where variant_fld = “some json string” will not work, we get a type mismatch error “The `=` does not support ordering on type “VARIANT””. The right way to delete the record is to cast variant to string first

delete from variant_tbl where variant_fld::string = "some json string"

Unnest JSON’s top-level keys into columns

First we’d like to find out what are the top-level keys in our JSON. We use the variant_explode table function:

select distinct key
from variant_tbl
    ,lateral variant_explode(variant_fld)
order by 1;

As a result of this query, we get the keys description, id, layout_type, notify_list, reflow_type, template_variables, title, widgets.

We are using here the key field of the variant_explode table function. It returns three fields — pos, key and value which makes it similar to the posexplode table function that also returns pos (position)

The other less elegant way to achieve the same result would be by using regular explode:

select keys.*
from variant_tbl
    ,lateral explode(map_keys(variant_fld::map<string,string>)) keys;

This example demonstrates the power and usefulness of the variant_explode expression

We would like to present our variant in a tabular format with top-level keys as columns.

First, we demonstrate the use of the variant_get function by explicitly calling it:

select variant_get(variant_fld,'$.description', 'string') description
      ,variant_get(variant_fld,'$.id', 'int') id
      ,variant_get(variant_fld,'$.layout_type','string') layout_type
      ,variant_get(variant_fld,'$.notify_list') notify_list
      ,variant_get(variant_fld,'$.reflow_type') reflow_type
      ,variant_get(variant_fld,'$.template_variables') template_variables
      ,variant_get(variant_fld,'$.title') title
      ,variant_get(variant_fld,'$.widgets') widgets
from variant_tbl;

The result is:

We usually don’t need to explicitly use the variant_get function, we can navigate the variant using “:” notation and get the same result as the above in a simpler and more elegant way:

select variant_fld:description::string
      ,variant_fld:id::int 
      ,variant_fld:layout_type::string
      ,variant_fld:notify_list
      ,variant_fld:reflow_type
      ,variant_fld:template_variables
      ,variant_fld:title
      ,variant_fld:widgets
from variant_tbl;

In both cases, if type is not defined, fields are defaulted to variant. In our example first 3 fields are casted to type while all other fields are returned as variant. Please note that in the results above the value of the field reflow_type is in double-quotes but the value of the field layout_type is not — that indicates that reflow_type is variant, while layout_type is string

Unnest JSON along the known path

Let’s unnest deeper levels of the JSON using chained variant_explode table functions:

select a.value:name::string
      ,b.value:definition:widgets:[0]:layout:height::int 
      ,c.value:layout:height::int
      ,c.value:id::string as netsed_id 
  from variant_tbl     
      ,lateral variant_explode (variant_fld:template_variables) a
      ,lateral variant_explode(variant_fld:widgets) b
      ,lateral variant_explode(b.value:definition:widgets) c

In this example, we explode two top-level variants template_variables and widgets. From template_variables we parse path template_variables:value:name . Widgets have another lower-level key with the same name, this nested widgets element is an array of variants. First, we parse path widgets:definition:widgets:layout:height only from one element of this array (with index 0). After the next explode, we can parse the same path from all elements of the nested widgets array. The very last explode is for this nested widgets array itself, this explode allows us to get the same path in a simpler and more concise manner. Also, we parsed a nested_id from this level — we may need it later

Here is the result of this query:

Inline explode in combination with variant_explode

Sometimes we may need to present data in the tabular format without explicitly listing the fields. This methodology is called “flattening” in ETL. For this type of tasks, we use inline table function in Databricks SQL. This function requires array of structs as an argument. The variant function schema_of_variant helps us to prepare an argument for the inline. Let’s get the schema of the top-level variant template_variables

select schema_of_variant(variant_fld:template_variables) from variant_tbl;

We get ARRAY<STRUCT<available_values: ARRAY<VOID>, default: STRING, name: STRING, prefix: STRING>>, which is an array of structs but we can’t use it directly:

First, we have to unnest this column to get the elements of type STRUCT<available_values: ARRAY<VOID>, default: STRING, name: STRING, prefix: STRING> and use the array SQL function on them
Second, we have to replace VOID with STRING

with cte as (select value 
               from variant_tbl
                   ,lateral variant_explode(variant_fld:template_variables))
select inline(array
                (value::struct < available_values: array<string>
                                  ,default: string
                                  ,name: string
                                  ,prefix: string
                               >
                )
              )
  from cte;

Result:

Now let’s look at a more complex example - parsing of the widgets top-level field. We are getting the schema of this field the same way, using the same schema_of_variant function. To make the example more interesting, we also added a few chained inlines. Also as an example, we used regular explode this time instead of variant_explode, this works as well

with cte as (select explode(variant_fld:widgets::array<variant>) level1 
               from variant_tbl)
    ,cte1 as (select inline(array(level1::STRUCT<definition: STRUCT<background_color: STRING, banner_img: STRING, layout_type: STRING, show_title: BOOLEAN, title: STRING, type: STRING, widgets: ARRAY<STRUCT<definition: STRUCT<background_color: STRING, content: STRING, custom_unit: STRING, font_size: STRING, has_padding: BOOLEAN, legend_columns: ARRAY<STRING>, legend_layout: STRING, markers: ARRAY<STRING>, no_group_hosts: BOOLEAN, no_metric_hosts: BOOLEAN, node_type: STRING, precision: BIGINT, query: STRING, requests: VARIANT, scope: ARRAY<STRING>, show_legend: BOOLEAN, show_tick: BOOLEAN, style: STRUCT<palette: STRING, palette_flip: BOOLEAN>, text_align: STRING, tick_edge: STRING, tick_pos: STRING, time: STRUCT<live_span: STRING>, title: STRING, title_align: STRING, title_size: STRING, type: STRING, vertical_align: STRING, yaxis: STRUCT<include_zero: BOOLEAN, label: STRING, max: STRING, min: STRING, scale: STRING>>, id: BIGINT, layout: STRUCT<height: BIGINT, width: BIGINT, x: BIGINT, y: BIGINT>>>>, id: BIGINT, layout: STRUCT<height: BIGINT, is_column_break: BOOLEAN, width: BIGINT, x: BIGINT, y: BIGINT>>))
                from cte)
select * 
except (cte1.definition
       ,cte1.layout
       ,a.widgets
       ,c.definition
       ,c.layout)
  from cte1
      ,lateral inline (array(definition)) a
      ,lateral inline (array(layout)) b
      ,lateral inline (widgets) c
      ,lateral inline (array(c.layout)) d
      ,lateral inline (array(c.definition)) e;

Result looks like this:

Dynamic parsing

Let’s do the inline parsing dynamically, for that let’s switch to Python — Databricks Notebooks allow us to mix cells. We parsed the JSON using its schema and renamed the fields — that may be needed to differentiate same-named elements on various levels and to remove special symbols for example. Also, to supress special symbols enclose field names in ``:

%python

# Parsing
sch = spark.sql("select replace(schema_of_variant(variant_fld),'VOID','STRING') from variant_tbl").first()[0]
df = spark.sql(f"select inline(array(variant_fld::{sch})) from variant_tbl")
print("After parsing:\n",df.schema.names)

# Renaming
for x in df.schema.names:
  df = df.withColumnRenamed(x,'`{}`'.format('top_level_'+x))
print("After renaming:\n",df.schema.names)

Result:

After parsing:
 ['description', 'id', 'layout_type', 'notify_list', 'reflow_type', 'template_variables', 'title', 'widgets']
After renaming:
 ['`top_level_description`', '`top_level_id`', '`top_level_layout_type`', '`top_level_notify_list`', '`top_level_reflow_type`', '`top_level_template_variables`', '`top_level_title`', '`top_level_widgets`']

Appendix A

Editing large Datadog Dashboard JSON manually may be time-consuming and error-prone. Let’s see if we can automate this process with Python. As an example, we would like to obfuscate all ids in our JSON — we exported it from real-life Databricks to Datadog integration and despite ids are meaningless, it is good practice to obfuscate them. Here is how we do it using variant data type and SQL in pySpark:

%python

import random

sql = """
          with cte as (select variant_fld:id::string id0
                            ,t.value:id::string id1
                            ,w.value:id::string id2
                        from variant_cat.variant_sch.variant_tbl     
                            ,lateral variant_explode(variant_fld:widgets) t
                            ,lateral variant_explode(t.value:definition:widgets) w)
          select id0 from cte
          union
          select id1 from cte
          union
          select id2 from cte

      """  

df = spark.sql(sql)

lst = [x[0] for x in df.collect()]

f = open('/Volumes/variant_cat/variant_sch/variant_vol/DatabricksDatadogTemplate.json', "r")
s = f.read()
f.close()

for x in lst:
    s = s.replace(x,str(random.randint(10000000000000000, 99999999999999999))[:len(x)])


f = open('/Volumes/variant_cat/variant_sch/variant_vol/DatabricksDatadogDashboard.json', "w")
f.write(s)
f.close()

Appendix B

GitHub Repository with the notebook, containing the examples code and the JSON file

Happy Databricks coding, enjoy the variant!

JSON Parsing Using Variant Data Type in Databricks SQL

Introduction

Tough choice: performance vs. flexibility

Performance

Open source

Complex JSON example

Variant expressions

Load JSON to a table

Unnest JSON’s top-level keys into columns

Unnest JSON along the known path

Inline explode in combination with variant_explode

Dynamic parsing

Appendix A

Written by Yuriy Margulis