Using Cloud CDF to transform XML to JSON

Shashank Tilwalli
Google Cloud - Community
5 min readJul 14, 2022

As of today Dataplex only allows you to store and organize data from Cloud Storage buckets & BigQuery. Dataplex supports formats such as Parquet, Avro, ORC, JSON & CSV.

But what about a format like XML? Or what if one has stored all its files in SFTP server instead of BigQuery or Cloud Storage? How would Dataplex support (Remember Dataplex only supports BQ or Cloud Storage).

In those cases, one of the ways is to use Google Cloud Data Fusion (CDF). Google CDF is a visual point & click solution for building ELT/ETL data pipelines. CDF is a managed solution that is powered by CDAP. These integrations or pipelines can be triggered manually or can be scheduled/automated. You can read more about it in the documentation links given below.

Recently I was working with a customer who has been exploring Dataplex but had a lot of their data stored in XML on an Sftp Server. We wanted to architect a pipeline which would read data from this SFTP Server, transform it to JSON & then write it to Google Cloud Storage or BigQuery. This article will show you how easy it is to use Cloud Data Flow to build such a pipeline.

Pre-requisites

  1. Cloud Data Fusion
  2. SFTP server with credentials
  3. Temporary GCS bucket with a sample XML File
  4. Destination GCS bucket in Google Cloud Storage

Deploying FTP Connector from the Hub

  1. On the Cloud Data Fusion Home Page click on HUB
  2. Search for FTP & Click on FTP Plugin v 3.0.0. This would deploy the FTP connector which you can later use to build your ETL process.

Building the ETL Flow

  1. From the Google Cloud Data Fusion select Wrangle. Think Wrangler as a transformation tool with predefined parsers.
  2. Select Cloud Storage Default from GCS and select the temporary bucket where you have saved your sample XML file & select a sample XML file from the bucket.

3. Change the data-type to String . By default when you select the sample file, the default format is byte. The parser accepts String as the data type which is why we are changing the type to String

4. On the same tab, also select Parse -> XML to JSON & select ‘Depth’ to 1. Cloud data fusion has many built in parsers. You can not only transform XML to JSON, you could also parse CSV, change data to Upper Case/Lower case, Manipulate dates etc.

5. Click on Create Pipeline -> Batch Pipeline. Also note that at this stage you should be able to see the corresponding JSON of the sample XML that you had selected.

6. Select ‘GCS’ from Sink and select ‘FTP’ from Source.

7. Delete the existing source .i.e. GCS File and set FTP as the source. Complete the wiring. At this stage your canvas should look like this.

8. Click on Source FTP & supply your SFTP details in the properties. To make it simple i am giving the full path (Ex: sftp://username:password@hostname:22/path/to/xmlfile.xml) of the XML file that I wish to transform & Click Validate

9. Go back to canvas and click on Wrangler properties. You should see that the directive is automatically set to setting the type to String & parsing XML to Json. Also note the output schema should be JSON schema that the XML will transform to.

10. Back on the canvas, Click on Cloud Storage at Sink. Provide information like bucket path where JSON needs to be uploaded, format as JSON & content-type as application/json

11. On the canvas and click on Preview->Run, you should now be able to convert XML to JSON whose source is SFTP

Summary

As you can see with a few clicks in the user interface how easy it was to build such a pipeline. Cloud Data Fusion offers more than 150+ connectors that can be used to implement various ELT/ETL use cases. To get started you can either use the Google Cloud Data Fusion or install the CDAP framework on your laptop. I would recommend using the local install for the development as it is much more flexible and allows you to debug complex flows. You can always export and import the projects once developed.

Below are some of the links that will give more information

  1. CDAP Open Source framework
  2. Google Cloud Data Fusion Documentation
  3. Google Dataplex Documentation

--

--