Use Databrick’s spark-xml to parse nested xml and create csv files.

Tenny Susanto
Feb 9, 2017 · 1 min read

I needed to parse some xml files with nested elements, and convert it to csv files so that it could be consumed downstream by another team. Databricks has 2 very cool libraries just for that. Spark-xml is a very cool library that makes parsing XML data so much easier using spark SQL. And spark-csv makes it a breeze to write to csv files. Here’s a quick demo using spark-shell, include this package like so:

spark-shell --packages com.databricks:spark-xml_2.10:0.4.1,com.databricks:spark-csv_2.10:1.5.0

Here’s the xml file I’m parsing.

Note how there are many <LineItem> per <Transaction>. I needed to flatten out all the line items, and repeat all values from the parent elements to each line item. You just have to use the explode function. Also note how to query an xml attribute vs an xml element (look at OperatorID in query below).

in spark-shell, run these:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df ="com.databricks.spark.xml").option("rowTag", "Transaction").load("/user/tsusanto/POSLog-201409300635-21.xml")val flattened = df.withColumn("LineItem", explode($"RetailTransaction.LineItem"))val selectedData =$"RetailStoreID",$"WorkstationID",$"OperatorID._OperatorName" as "OperatorName",$"OperatorID._VALUE" as "OperatorID",$"CurrencyCode",$"RetailTransaction.ReceiptDateTime",$"RetailTransaction.TransactionCount",$"LineItem.SequenceNumber",$"LineItem.Tax.TaxableAmount"),false)selectedData.write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("POSLog-201409300635-21_lines")

Tenny Susanto

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade