Transform JSON & XML Data and perform REST calls in DataStage — a Walkthrough of the Hierarchical Data Stage

Alexander Polus

Published in

IBM Data Science in Practice

5 min readJul 18, 2022

DataStage on Cloud Pak for Data v4.5: New Features Release

It’s hard to believe it’s been seven months since the re-architected DataStage was released on Cloud Pak for Data in…

medium.com

You have JSON or XML data that you need to process or you would like to perform REST calls. The Hierarchical Data stage provided by IBM DataStage allows for processing of hierarchical data (data that comes in the form or records organized into a tree-like structure, or parent-child structure) which includes two industry-standards, JSON and XML documents. For each type, this stage offers both parsing and composing functionality. Let’s take a look at the Hierarchical Data stage’s components in detail including for JSON and XML data processing. We’ll also examine how it can be used to create a connection to a Representational State Transfer (REST) service.

Setting up the flow: Adding the Hierarchical Data Stage to your canvas and using the Assembly Editor

When you add a Hierarchical Data stage to a DataStage Flow, you will utilize a sub-flow design paradigm to separate the details of the Hierarchical Data stage from the rest of the parent flow. By double clicking the Hierarchical Data stage, you open its stage details, where you can select Edit assembly to interact with the stage’s assembly sub-flow.

The Hierarchical Stage and its tear sheet

The Hierarchical Data stage’s Assembly Editor reuses the flow design palette from the parent canvas with drop-down menus and stages, including options for the Input and Output step, Transformation steps, Web Services (REST step), and the Parser and Composer functions.

The Input and Output Steps

The Hierarchical Data stage includes an input step and an output step within its assembly as unique entry and exit points for the sub-flow. The Hierarchical Data stage supports multiple input and output links.

View of the Assembly Editor including an input step, REST step, JSON parser, and output step

Between the input and output step is when you would define a REST step or set up a reusable connection for multiple REST steps, parse or compose the JSON/XML input or output, and apply any transformations before passing the final tabular data outside of the Hierarchical Data Stage’s assembly. The input and output steps are automatically created in the Assembly Editor when a Hierarchical Stage is brought to the canvas, but are also present in the palette in case either the input or output is deleted and must be re-added. The output step offers some additional functionality that we’ll look at later.

Web Services: REST Steps

One of the most common ways of connecting to applications is via REST (Representational State Transfer) APIs. You can drag and drop a REST step from the Hierarchical Data stage palette onto a link (this adds the step to the flow). Double-click it and a dialog box will pop up where you can select an HTTP method along with the URL for the service (you can also create a reusable connection so that while the job is running the connection is shared to invoke multiple REST services created by the multiple steps which use the same reusable connection). Furthermore, on the Security Tab, you can select from a set of authentication protocols and supply the associated credentials. Keep in mind that the URL for the REST endpoint can be parameterized to allow more flexibility for mutable queries. The Request and Response tabs are where you can specify the request body and response body of the REST step. The Mappings tab contains the parameters or elements in the connection and request which need to be mapped to fixed values or input data items.

The REST step tear sheet, offering selections for HTTP Method, the URL to contact, and more

Parsers and Composers

A GET call to a REST API (as an example) can provide response data. Certain REST endpoints will return JSON data. Fortunately, JSON is one of two standard formats that the Hierarchical Stage can parse or compose (the other being XML). Opening the JSON Parser’s dialog, you first see the ability to choose between three JSON sources: a string set (a node that contains JSON data which can be coming from a previous input step or can be a chunked node from a previous parser step), a single file, or a file set (read multiple JSON files that are based on the same schema). Next, you Select the string set or file(s) that you want to process. Finally, you Browse for an available data schema corresponding to our data (a JSON schema describing the JSON data that you want to process), or add a new one via the Open libraries tab. This step will advance the data in tabular form.

The parser step offering string and file options for the data source

Transformations

The tabular data can now be transformed using a subset of DataStage’s usual stages, including: Aggregate, HJoin, H-Pivot, Order join, Regroup, Sort, Switch, Union, and V-Pivot. Other stages or transformations should be applied on the main canvas, outside the scope of the Assembly Editor and Hierarchical Data stage.

The Output Step

Once the data and schema have been selected, use the output step to map our output starting at our selected level of hierarchy. The values with a checkmark on the right will be mapped to output.

To review what the output will look like, and to edit the schema of the output table, visit the Output column tab:

The Output column tab previewing the tabular schema

Here you can edit column names, data types, and more. Once the output step is complete, our Hierarchical Data stage is ready to process your hierarchical data.

Get Started

If you would like to try the Hierarchical Data stage for JSON or XML document processing (or performing REST calls as part of your data pipelines), you can get started with the DataStage on Cloud Pak for Data as a Service Multicloud Data Integration trial. A DataStage instance is automatically provisioned for you and you can run your first DataStage job in less than 5 minutes! You can also use the Hierarchical Data stage with the UI shown in this blog post in DataStage on Cloud Pak for Data v4.5 software.

DataStage Use Cases: https://www.ibm.com/products/datastage/use-cases

DataStage is an integral part of a Data Fabric: IBM was named a leader in The Forrester Wave™: Enterprise Data Fabric, Q2 2022