DataStage on Cloud Pak for Data v4.5: New Features Release

Published in

IBM Data Science in Practice

9 min readJun 30, 2022

It’s hard to believe it’s been seven months since the re-architected DataStage was released on Cloud Pak for Data in v4.0.2.

DataStage on Cloud Pak for Data: World-class parallel engine for data engineering use cases

Here are six areas where we’ve improved DataStage for you with last week’s release on Cloud Pak for Data v4.0.2.

medium.com

Today, we released DataStage on Cloud Pak for Data v4.5 software which brings DataStage pipeline orchestration, UI improvements and additions (including additional stages and connectors), and more outstanding administration options, including auto-scaling and dynamic workload management.

Developers build flows using the DataStage Elyra-based canvas.

Here are six areas where we’ve improved DataStage for you with today’s release of Cloud Pak for Data v4.5.

1. New! Watson Studio Pipelines is available through the Early Access Program for Pipeline Orchestration

Watson Studio Pipelines provides orchestration capabilities for DataStage. With Watson Pipelines, you can orchestrate DataStage jobs in a controlled way, meaning you can control job execution order. You can create different courses of action depending on whether a job is successful, enforce a job execution order, including waiting for specific jobs to complete or for all jobs to complete, and run particular sections in a loop. You can run Bash scripts in a pipeline. And you can notify users by sending e-mail notifications. The Watson Studio Pipeline service is available for use beginning with the v4.5 release. Once the Watson Studio Pipeline service is enabled (via the IBM Early Access Program), users can import existing DataStage Sequence jobs with the migration service. The migration service will translate sequence jobs into Pipelines. Users can also create new Pipelines with the nodes below (check out the new expression editor with type-ahead capabilities).

Stop! What is Watson Pipelines, again?

Watson Pipelines is a Cloud Pak for Data service used by DataStage, Watson Studio, Data Refinery, AutoAI experiments, etc. Rather than create a separate DataStage Job Sequencing capability that DataStage would only use, we partnered with the Watson Pipeline service to bring DataStage functionality to the Pipeline service. This means that you can orchestrate notebooks alongside DataStage job activities. Watson Studio Pipelines is built based on Kubeflow pipelines on Tekton runtime and will allow DataStage developers to orchestrate their DataStage jobs alongside AI workloads.

What exactly can you do with Watson Pipelines for DataStage job orchestration? Here are the available capabilities:

DataStage Sequence Job activity → Watson Pipeline flow

Job Activity → Run DataStage flow
Execute command → Run Bash script
Nested condition → Run Bash script (today)
Sequencer (all) → Wait for all results
Sequencer (any) → Wait for any results
Terminator → Terminate pipeline
Start/end loop → Loop in sequence
User variable → Set user variables
Wait for file → Wait for file
Notification → Send email
Exception handler → (Handle errors ✓ Add a response for Continue flow on error ✓ Show icon on nodes linked to Handle errors flow)

For more information about the Watson Pipelines Early Access Program:

Data Threads: Introducing IBM Watson Studio Pipelines Beta on Cloud Pak for Data 4.5

There’s been increased adoption of machine learning as more enterprises see it as a critical part of making data-driven…

medium.com

2. Enhancements to the visual data pipeline editor and projects experience

When developers build a data flow with DataStage, they use a drag-and-drop (no code) visual editor. The Asset Browser we introduced in 4.0.2 was enhanced so users can add subflows (previously called shared containers) or navigate the DataStage persistent volumes to add data assets directly to their flows.

We added link decorations so you can see indicators of job progress directly from the canvas UI and see if there have been any errors; you can monitor processed rows, including metrics. You can download flows with their dependencies from the canvas (and import them into other projects). Users can create message handlers from the Logs panel with an easy-to-use button (promote log messages to warnings, demote to informational, or suppress from logs). In addition, we added the ability to execute a shell command as a before- or after-job subroutine or generate a report as an after-job subroutine.

The Projects experience is enhanced so users can easily navigate to different asset types in the side panel (Flows include DataStage flows and Pipelines, Configurations includes Parameter Sets, and DataStage components include Data definitions). Administrators can efficiently perform bulk operations (such as delete).

The Jobs tab in Projects provides the ability to filter jobs by Schedule and Run Status (Active runs, finished runs, etc.).

3. Scalability and security enhancements

When you install the DataStage service in Cloud Pak for Data, a parallel engine DataStage instance is automatically created for you. This parallel engine instance is the runtime environment that DataStage jobs run on (data plane). By default, it consists of a px-runtime pod (the conductor) and multiple replicas of px-compute pods (the compute). The conductor (px-runtime) pod is used to startup jobs, determine resource assignments, and create processes on one or more processing nodes. The conductor acts as a coordinator for status and error messages, and manages orderly shutdown when processing completes or in the event of a fatal error.

Conductor and compute pods can be scaled horizontally and vertically. You can increase compute availability and resiliency by increasing the number of replicas of the conductor (px-runtime pod).

Performance and scalability benefits with DataStage dynamic workload management and auto-scaling

Dynamic workload management is automatically configured out-of-the-box. With this enabled, DataStage generates a parallel configuration file at job runtime based on the resources available in a runtime instance and the runtime instance environment definition. If auto-scaling is enabled, more compute pods are deployed for the runtime instance as needed to meet bursts in work loads. In addition, a job that runs in an auto-scaled environment will automatically run across the available compute pods without needing developer intervention.

Only users that have been given access to runtime instances can run jobs on them (access can be granted to individual users or user groups). Additionally, administrators can manage which mounted persistent volumes are available for each instance to control which files each instance and users can access.

As additional components and stages were added, the cpdctl dsjob plug-in was enhanced to ensure we provide a CLI and API-based method to manage assets such as: projects, environments, flows, jobs, DataStage components (including parameter sets, subflows, and table definitions), project and flow import/export, and Hierarchical, Build, Custom and Wrapped stage management. In addition, the cpdctl dsjob plug-in also is used to manage pipelines, including listing pipeline flows, running pipelines, getting pipeline logs, getting pipeline versions, listing pipeline runs, and importing/exporting pipelines.

4. Added stages, connectors, and new Transformer functions

DataStage continues to be an extensible service for developers. We added support for the Java Integration stage, which allows you to bring your own Java code to DataStage; the Build Stage to create your C++ operator for DataStage; the Wrapped Stage to specify a UNIX command to be run by a DataStage stage.

The Hierarchical data stage has a new user interface in DataStage on Cloud Pak for Data. It provides JSON and XML parsing and composing capabilities, an interface to interact with REST services, and powerful hierarchical data transformations (you can read/write files in /ds-storage or any other location accessible to the DataStage runtime instance). We also added the Complex Flat File stage and support for Stored Procedures to the SAP ASE (Sybase), SQL Server, and Azure SQL connectors.

The Transformer stage has hundreds of built-in functions that developers can use to create possible permutations for their data pipelines. We have enhanced the Transformer to facilitate customer migrations, eliminate the need for custom routines, and ensure this stage continues to be one of the most potent, valuable stages for data engineers. We added new functions to the Transformer stage, which include and are not limited to: MD5 and stringtomd5 for cryptographic hash algorithm, base64enc takes an input string and converts it to base64 encoded string, base64dec takes an input base64 encoded string and converts it to a string, isbase64 checks if a given string is base64 encoded, a reverse string function, a search and replace function, a timezone conversion function, and a remove unprintable characters function.

Our QualityStage modernization is underway. With the 4.5 release, we added the migration for the Match Frequency and One- and Two-Source Match stages to facilitate customer migration. We have already delivered the Standardize, the Address Verification Interface, and Investigate stages between 4.0.2 and today. The ability to modify Match stages is coming soon!

DataStage added connectors include Google Cloud Pub/Sub, IBM MQ, Microsoft Azure Cosmos DB, Microsoft Azure SQL Database, Generic S3, and Teradata (“optimized” connector uses the native database client libraries). In addition, we introduced the ability to switch between DataStage native connectors (leveraging the database client libraries) and Java-based connectors via a simple switch in the stage property editor.

5. Built-in migration service enhancements

The import and conversion of existing parallel and sequence jobs are paramount to the adoption of DataStage in Cloud Pak for Data. Our team built a Migration microservice solely designed to facilitate the migration of existing DataStage projects with millions of DataStage jobs to DataStage in Cloud Pak for Data. This microservice was introduced in DataStage v4.0.2 and in the months since then we have continued to add functionality (including improving sequence job migration support). We added plug-in migration support so Db2 Enterprise, Oracle Enterprise, ODBC Enterprise and other legacy “plug-in” stages are migrated to the modern connectors via the migration service. We also added support to migrate Stored Procedure stages that are used for SQL Server and SAP ASE (Sybase). We enhanced our before/after SQL support with the migration. In addition, flows with common connections will migrate to one connection asset which enhances connection sharing. For migrated connector information, parameterization is supported so local job parameters and parameter set references in connection assets are evaluated at runtime. Missing connector properties we find during migration will be given local parameters.

6. New! DataStage Server to Parallel conversion utility is now GA!

The new DataStage architecture on Cloud Pak for Data contains the best-in-class parallel engine. Suppose customers want to migrate DataStage Server jobs to the DataStage on Cloud Pak for Data version. In that case, they can use the Server to Parallel conversion utility to first convert the Server jobs to Parallel jobs and then migrate them to Cloud Pak for Data.

Our goal in creating the conversion utility was to minimize manual conversion effort to the extent possible, prioritize generating functionally-accurate parallel jobs, and for MettleCI to add DevOps value for unit test generation and automation.

For more information about the Server to Parallel conversion utility, review the latest technote and get in touch with your IBM team.

Get Started and bring the DataStage Parallel Engine to modern data architectures

Re-thinking DataStage to bring the parallel engine into a containerized architecture in software was achieved in Cloud Pak for Data v4.0.2. This shift and the modular infrastructure (runtime, compute, migration, canvas, and other microservices) has made DataStage development highly efficient and simple to manage. For our customers, this means a highly scalable architecture that is easy to deploy anywhere. We continue to work on the native Kubernetes capabilities that lend themselves for DataStage including managing the running compute resources in the environment, scaling up or down the amount of compute pod replica resources based on the workload and evenly balancing jobs across the infrastructure with dynamic workload balancing.

New features and functionality continue to be pushed weekly in IBM DataStage on Cloud Pak for Data as a Service.

Get started with DataStage on Cloud Pak for Data as a Service today with the Multicloud Data Integration trial — a DataStage instance is automatically provisioned for you and you can run your first DataStage job in less than 5 minutes!

In the next couple of months, we will continue to bring features and new capabilities to DataStage developers on Cloud Pak for Data in our monthly releases as we drive towards our goal of supporting today’s mission-critical workloads. Additional stages and connectors are being designed and developed (such as for data quality use cases). DataStage is an integral part of a Data Fabric — IBM was named a leader in The Forrester Wave™: Enterprise Data Fabric, Q2 2022.

Let us know what you think — we want to hear your feedback on what we can do better — so reach out to us on our Community Forum. To find out more about DataStage on Cloud Pak for Data, reach out to your trusted IBM team.