DataStage Modernization with Cloud Pak for Data

Umaporn Juealaong

Published in

IBM Cloud Pak Tips and Good practices

5 min readApr 3, 2022

--

Lake Tahoe is widely considered one of the most beautiful places in the world

IBM DataStage and the InfoSphere Information Server platform have historically been deployed to handle large scale enterprise workloads and perform mission critical functions. To ensure a seamless move to a modernized AI and cloud ready architecture, IBM has 2 imperatives for its modernization: 1. Allow clients operations to function uninterrupted, 2.Provide an even higher level of resiliency, scale, automation and operational efficiency on a modernized platform.

What is “Datastage Modernization” ?

IBM is modernizing architecture flagship ETL tool “Datastage” by modernizing mean we have containerizing Datastage and making solution microservice base.

Today the business is increasingly moving to operate the applications on the cloud. When we operate on the cloud this required cloud native architecture principle hence re-architecture Datastage.

Containers allow to the applications to separate from environment to actually run against and then this allow container base applications to be easily deployed consistently regardless whether the target environment is on-premise data center, a public or a private cloud or even be personal laptop. An agile nature of containerize deployment is absolutely crucial for enterprises to understand the continue on the journey to the cloud by utilizing containerize applications you can design application one time and run anywhere.

Automatic workload balancing and best-of-breed parallel engine

The key feature of DataStage on Cloud pak for Data that allow customer to take advantage of highly parallel engine and the ability to scale. The particular feature allows customers to seamlessly scale horizontally by Elastically scaling the instances that are available for processing. Example picture below shows a scenario where 6 jobs are submitted to run. Currently they all run and are scheduled on a first compute instance. Now we have a second workload that workload has 4 jobs. Traditionally we have a few option to deal with these additional jobs we can choose to queue them up till after the first 6 jobs complete, or we can choose to try to fit then onto the first compute instance. But by doing this we will ultimately impact the overall performance of all 10 jobs. The new option that’s now available as part of Cloud Pak for Data is Cloud Pak for Data, can elastically scale up another instance of compute resources. This allows us to automatically reschedule those 4 jobs to run those second compute instance without the user having to specify any additional hardware requirements. This allows you to not to have static assignments of hardware for your entire DataStage workload and to be able to react more elastically to resource requests.

Modernize DataStage/Containerize DataStage has 2 ways to consume DataStage; DataStage aaS, DataStage on IBM Cloud Pak for Data.

DataStage SaaS on Cloud Pak for Data aaS

IBM Cloud Pak for Data services delivered as-a-Service on IBM Cloud modernize how you collect, organize and analyze data and infuse AI with no installation, management or updating required. Using Cloud Pak for Data as a Service with IBM Cloud Satellite can provide the as a Service experience they want while being able to keep their data where it needs to be and ability to build DataStage flow once, run the job in any Cloud.

2.DataStage on IBM Cloud Pak for Data

A fully containerize Enterprise inside platform build on RedHat Openshift. For a current on-premise customer who are continue to run DataStage on-premise or single tenant you will can run on-premise, on AWS or Azure on top of that they are 2 added benefits to consuming DataStage on Cloud Pak for Data which our refer to a DataStage cartridge and the advantages are 1. Bundle Mettle CI license, the solution for true CI/CD for DataStage, 2.Due entitlement allow you to use both legacy DataStage solution and the new DataStage cartridge on Cloud Pak for Data give you ability migrate on page.

Re-image DatStage with DataStage Flows

If you use DataStage SaaS or DataStage on Cloud Pak for Data you will design and run DataStage jobs with new and modern UI, no more windows client. You will create a DataStage Flow (a Design Asset) by importing existing DataStage jobs or by creating new DataStage Flows using a modernized, Web-based Canvas. The canvas is easy to use and ready to build data pipelines for Ground to Cloud, Cloud to Cloud, or Cloud to Ground.

Jobs are Runtime assets, meaning multiple jobs can be associated with a single DataStage Flow. Each job can have its own schedule, default parameters, etc.

Design a job from scratch, and you will notice you do not spend a lot of time mapping columns due to Auto-Column Propagation. The image above demonstrates how DataStage automatically propagates a new column to downstream stages. Also, check out how we drop a stage in a link.

When you execute a flow, the job log panel allows developers to filter, search, and save logs directly from the designer. The links are interactive, so if you click on a stage, the canvas is re-centered on that stage.

Job management and scheduling is accessible via the GUI. You can also schedule jobs to run through APIs. Orchestration Flows will replace sequences.

Migration to Next-Generation DataStage

You will use the ISX import service, available natively in Next Generation DataStage for say migration of existing supported jobs. A comprehensive migration path from traditional DataStage version 11.7, 11.5 to the next generation DataStage. Existing DataStage resources, such as parallel jobs, parameter sets, and table definitions are exported as an .isx file via the export tool. Using the Migration tool, all the exported resources in the .isx file are imported into the next generation DataStage as new resource assets. These new resources have new lift cycles managed in the nextgen DataStage UI.

Get Started Using Next Generation DataStage Try DataStage SaaS Now

Cloud Computing

Cloud Pak For Data

Umaporn Juealaong

Written by Umaporn Juealaong

Editor for

IBM Cloud Pak Tips and Good practices

Customer Success Manager at IBM. Always eager to learn, share and expand knowledge. I’m a coffee lover and also love to traveling.

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams