IBM Cloud Pak for Data v5.0 DataStage User Experience: New Features and Improvements

Ryan Pham
11 min readJun 18, 2024

--

Bringing new features to market for both on-prem and Cloud environments for DataStage within the IBM Cloud Pak for Data platform is challenging — both require user-friendly interfaces, intuitive workflows, and comprehensive support to empower users to harness its rich data capabilities. With the release of IBM Cloud Pak for Data v5.0, the DataStage team has once again delivered on this commitment to ensure that the new features meet the evolving needs of data professionals by providing a cohesive and optimized experience across both on-prem and Cloud platforms.

Support for folders

Folder support has finally arrived on the CPD platform, allowing you to organize your assets better. To get started, you must opt-in to this feature. To do so, go to the project Manage > General > Controls > and click the Enable folders button.

While this feature is in beta, you can create, rename, move, and delete folders. For migration, you must import your 11.7 .isx or CPD .zip files. The Assets page left navigation panel will now display a Folders tab that you can click on to view your assets by folders. Various folders, such as Root/Folder-A and Root/Folder-A/Subfolder-1, are expanded in the following image. The count to the right represents the number of assets in that folder, such as Root/Folder-A/Subfolder-1, having 11 assets. Clicking on a folder shows the assets contained within that folder. You can also use the folder breadcrumbs to move to a previous folder. This feature will mature in later point releases for DataStage in Cloud Pak for Data v5.x.

Asset relationship viewer and impact analysis

Say you were to rename a parameter set. Without understanding what assets use that parameter set, you would have no idea which flows could be impacted, need to be updated, or potentially break hundreds of flows. With relationships, you have insight into the dependencies between DataStage assets.

DataStage in Cloud Pak for Data v5.0 automatically registers “uses” relationships between other DataStage assets, as well as registers the inverse “is used by”. For example:

  • A DataStage job depends on a DataStage flow — that a job uses a flow and the inverse a flow is used by a job.
  • A DataStage flow uses a parameter set and the inverse a parameter set is used by a DataStage flow.
  • A DataStage flow uses a DataStage subflow and the inverse a DataStage subflow is used by a DataStage flow.
  • A DataStage subflow uses a connection and the inverse a connection is used by a DataStage subflow.

Click on an asset overflow from the Assets view and choose View relationships. The same action may be available when editing an asset from either the toolbar or the asset (i)nfo panel. Once clicked, the asset relationship viewer launches in a new browser tab.

This example shows the relationships for a job, MyFlow.DataStage job and all assets it uses. Use the context switch to toggle between viewing the Uses or Is used by relationships. The left panel shows a hierarchy of the assets used. Here, MyFlow.DataStage job uses MyFlow, which uses MyParamSet, and MySubflow, which uses MyConnection. You can Open an asset or view its relationships. The data table allows you to view all or first-level related assets. Use the Find text field to search for a particular asset.

The View relationships action will also be available when performing other asset actions, such as deleting a subflow. On the confirmation modal, you can click the View relationships button to launch the asset relationship viewer and review the impact of proceeding with the action.

New features for Import

You can now choose which asset types to import when importing a .isx or .zip file using New asset > DataStage flow > Local file. By default, all asset types will be included. This will be very useful when you’re given a file and only want to import the DataStage flows, parameter sets, pipelines, and subflows. Look for this feature to add options for fine-grain control and enhancements to Download in later point releases for DataStage in Cloud Pak for Data v5.x.

New features for DataStage flow compile

Compile has several exciting changes for DataStage in Cloud Pak for Data v5.0. On the Assets > DataStage flows view, the data table will show you the Last compiled details for the flow. The state can be:

  • Compiled — the flow has been compiled successfully and represents the latest flow.
  • Not compiled — the flow has never been compiled successfully.
  • Stale — the flow has been compiled successfully, but the current flow has since been modified and not re-compiled.

On the Assets > DataStage flows view, you can also [x] bulk-select and one-click Compile to recompile the selected flows.

Pushdown support

DataStage’s primary process is Extract, Transform, and Load (ETL), in which data is read into memory, processed, and then written to a target. By default, all jobs run in ETL mode in DataStage. Using the DataStage flow Canvas > Settings > Compile tab, you can run your job in an optimized mode using SQL Pushdown, which moves data processing work to the source or target databases.

  • Pushdown to source (TETL) pushes transformation work down to the source database, performing as much work as possible before extraction. If the DataStage flow can be only partially converted to SQL, the remaining work is performed in the chosen environment.
  • Pushdown to target (ELT) pushes transformation work down to the target database, performing as much work as possible after loading. The remaining work is performed in the chosen environment. When the analysis determines that the DataStage flow can be converted to SQL, ELT mode is used, and DataStage compiles the flow to SQL. When the analysis determines that the DataStage flow can only be partially converted to SQL, ETL, and ELT modes are used as needed. See https://www.ibm.com/docs/en/cloud-paks/cp-data/4.8.x?topic=jobs-elt-run-mode for additional information.

Canvas Design-time Improvements

We continue to improve the user experience when editing a DataStage flow. On the DataStage flow Canvas > Settings > Design tab, the following features have been added:

  • Save frequency allows you to control when the flow is saved. When you close a stage Details card, the changes are automatically saved by default. However, selecting Apply changes temporarily does not persist changes made to a flow until you click the Canvas toolbar Save action. This allows you to switch between stages and links, make your desired changes, and then click Save to persist all your changes.
  • Column metadata change propagation toggle lets you decide whether you want changes made to column metadata, such as changing a column COL_1 CHAR(100) to VARCHAR(256). When toggled off, changes made to column metadata will not be propagated downstream. Even if you leave the Canvas setting toggled on, you can still go into a stage and turn it off for that stage. When a stage column metadata change propagation is toggled off, changes made to column metadata will not flow beyond that stage.
  • Flow connection parameter value session cache allows you to clear any parameterized flow connection values you entered during a successful Preview data, Test connection, or browse connection.

Notification when DataStage flow is modified by someone else

We’ve decided not to implement asset locking as we move towards a more collaborative user experience. In 11.7, when you edited a Job, the job was locked, and other users could not edit it. If you needed access to the Job and couldn’t find who had the lock, you would need to get an administrator involved and have them force an unlock. The DataStage flow Canvas will notify you when someone else has modified the flow you view. Click the Reload button to refresh your canvas with the latest version. If you choose not to and Save, your change will overwrite the other users’ change. For now, you must coordinate your changes with the other user. Expect additional improvements in this area in future CPD v5.x releases, including rolling this feature out to all DataStage asset types.

New DataStage flow and subflow Replace action

When working on a subflow, typically, you will create a duplicate of the subflow so as not to break existing flows. You then edit the duplicate, make changes, and test — and now what? You could edit all the flows that use the original subflow and replace it with the new subflow — but that is a lot of work.

Using the new Assets > DataStage flow or DataStage subflow > Replace action, you can replace a DataStage flow or DataStage subflow with a different asset. This action is also available on the canvas toolbar. First you’ll be prompted to choose the asset to replace with. In this example, Subflow1 will be replaced with Subflow 2. In addition, you can choose whether or not to recompile the DataStage flows that use the subflow. The compile step, while optional, is recommended as DataStage flows that reference the subflow will need to be recompiled to include the new subflow. When replacing a DataStage flow, you will want to select [x] Compile if you updated the parameters used in the flow.

Job run improvements

Two job run-related improvements to call out —

  • Job runs now have a Job priority queue. You can choose whether your job run is Low, Medium (default), or High priority. This is useful when you have many queued jobs — higher-priority job runs will take priority over lower-priority jobs. You can specify as a project-level default using Manage > DataStage > Settings, within the DataStage flow Canvas > Settings > Run, or on the Jobs > Job Details > Edit configuration > Settings.
  • You can now 1-click re-run a job from the Jobs> Job Details by clicking on the Run job icon. The Settings page will appear, allowing you to configure the run or runtime parameters.

Job run metrics for stages and links

For a DataStage flow job, you can now view the job run metrics, including the average throughput, rows written, rows read, and elapsed time for the flow overall, as well as between two stages. These metrics are updated in real-time to reflect the job run execution. If configured, they will also be persisted in the DataStage metrics repository. On a DataStage flow Canvas, click the Run metrics icon. The Run metrics tab, shown below, will open. You can also view these metrics using the Jobs > Job run details. Use the dropdown to filter the metrics shown (In progress, Completed, or Failed) or the Find text input to find a particular stage or link.

DataStage metrics repository

We also added support for persisting the job run stage and link metrics into a repository, such as PostgreSQL. You can configure your repository from the project > Manage > DataStage > Metrics repository tab. If you update the connection properties, click the Test connection button first. If successful, then Save will be enabled. You can then query the tables associated with the metrics repository and look for patterns in your runtime results to optimize execution.

Column metadata improvements

We’ve made several improvements when working with column metadata, allowing you to get things done quicker. First, column editing is now available on the Stage > Input tab. Before DataStage for Cloud Pak for Data v5.0, you could only modify columns on the Output tab. No more! With this improvement, you can modify the input column metadata without opening the stage from which the input link flows. Column metadata changes made on an input link will be applied to that link.

Second, you can now bulk edit columns. On a Stage > Input or Output > Columns, click Edit to launch the column metadata tearsheet. Next, bulk select the columns you want to modify and click the Edit pencil icon. A details card will show you the properties for the selected columns. In this example, we’re changing the columns FIRSTNME VARCHAR(12) and LASTNAME VARCHAR(15) to both be VARCHAR(256). After clicking the Apply button, the selected columns will be updated.

Additionally, you can bulk add and edit Stage key columns. For example, when adding a Sort stage key, you can bulk select multiple columns to add, choose their Sort order, and optionally add additional properties. You can also bulk select existing keys and change their properties, such as changing all sort keys from descending to ascending sort order.

Lastly, on a Stage > Output > Columns, you can turn [x] Column metadata change propagation on or off. You can also control it at any stage based on the Canvas > Settings mentioned earlier. When toggled off, column metadata changes made in prior stages will stop at this stage.

New Connectors and features

DataStage continues to be a leader in the number of supported data sources. With over 90 connector types supported, DataStage in Cloud Pak for Data v5.0 added the following connector types to the DataStage flow Canvas palette:

  • Apache Derby
  • DataStax Enterprise
  • IBM Planning Analytics
  • Looker
  • Microsoft Azure Databricks
  • MinIO

Other noteworthy changes

There were hundreds of other improvements to the DataStage user experience. Some additional noteworthy changes worth mentioning are:

  • Column metadata Extended support for Timezone and Microseconds+Timezone.
  • Connections can now be #parameterized#.
  • Record ordering and key columns for Db2, JDBC Teradata, ODBC optimized connectors, Oracle, and Snowflake.
  • Reject link support for SCAPI-based connectors.
  • Stored procedure and proxy option for Google BigQuery.
  • WIF authentication support for Google Cloud Storage and Google Cloud Pub/Sub.
  • You will be prompted to provide the parameter value when using a parameterized flow connection. If the Preview data, Test connection, or browser connection is successful, those values are cached for the remainder of the Canvas session. You will not be prompted to specify the values again. Clear the cached value using the DataStage flow Canvas > Settings > Design > Clear cache button.
  • Parameter sets and data definitions can be renamed.

Summary

These are just a few of the hundreds of new features and improvements that DataStage has delivered for IBM Cloud Pak for Data v5.0. We continue to bring new features to market for both on-prem and Cloud environments for DataStage within the IBM Cloud Pak for Data platform, which provides user-friendly interfaces and intuitive workflows, empowering users to harness its rich data capabilities.

Author: Michael Pauser

--

--