Use Databrick’s spark-xml to parse nested xml and create csv files.
I needed to parse some xml files with nested elements, and convert it to csv files so that it could be consumed downstream by another team. Databricks has 2 very cool libraries just for that. Spark-xml is a very cool library that makes parsing XML data so much easier using spark SQL. And spark-csv makes it a breeze to write to csv files. Here’s a quick demo using spark-shell, include this package like so:
spark-shell --packages com.databricks:spark-xml_2.10:0.4.1,com.databricks:spark-csv_2.10:1.5.0
Here’s the xml file I’m parsing.
Note how there are many <LineItem> per <Transaction>. I needed to flatten out all the line items, and repeat all values from the parent elements to each line item. You just have to use the explode function. Also note how to query an xml attribute vs an xml element (look at OperatorID in query below).
in spark-shell, run these:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Transaction").load("/user/tsusanto/POSLog-201409300635-21.xml")val flattened = df.withColumn("LineItem", explode($"RetailTransaction.LineItem"))val selectedData = flattened.select($"RetailStoreID",$"WorkstationID",$"OperatorID._OperatorName" as "OperatorName",$"OperatorID._VALUE" as "OperatorID",$"CurrencyCode",$"RetailTransaction.ReceiptDateTime",$"RetailTransaction.TransactionCount",$"LineItem.SequenceNumber",$"LineItem.Tax.TaxableAmount")selectedData.show(3,false)selectedData.write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("POSLog-201409300635-21_lines")