DataStage on Cloud Pak for Data v4.0.2: World-class parallel engine for data engineering use cases

9 min readOct 11, 2021

--

The gorgeous rocky shore of Lake Tahoe has snowy mountains in the distance.

Imagine you see a beautiful, wild mustang with powerful, copper-colored legs trotting along a lake, pounding the rocky sand, letting out a breath every couple of steps. You innately know he is just one of those horses for whom sprinting is as easy as breathing. Imagine putting a saddle on the horse, feeling his speed, and galloping for miles and miles without slowing down. Nothing’s holding you both back.

Now, think about taking that horse and weighing him down with saddlebag after saddlebag and forcing him to eat a different kind of food than what he was destined to consume and drink water that will never be as pure as that crystal blue lake. These restrictions are unnatural to him, and he starts to slow down as he becomes burdened by the loss of natural aerodynamics or other changes. Trying to keep him in racing shape is a fool’s errand. A mustang should not be burdened with these kinds of restrictions. A mustang is meant to run.

To get the mustang back to the way he was at the lake, you go back to the basics, the same things that made him so effortlessly fast in the beginning. The extra things you thought were useful made it harder to keep him racing for miles.

For 20+ years, IBM DataStage has been the high-performance, world-class mustang for mission-critical workloads where the best throughput is needed. However, we knew that to address a changing data landscape, hybrid cloud workloads, and new data engineering scenarios, we needed to make bold changes in DataStage. Our approach was to respect the history and heart of DataStage that made it race for 20+ years — the mustang, parallel engine. So we re-built the components around the parallel engine, making them modular, easy-to-manage, and delightful for developers to make this transition. Our vision was to ensure that developers who fell in love with the powerful mustang would have a seamless path forward while exciting data engineers who are designing the critical workloads of tomorrow.

Here are six areas where we’ve improved DataStage for you with last week’s release on Cloud Pak for Data v4.0.2.

Update (June 30, 2022): DataStage on Cloud Pak for Data v4.5 is now available! Read more about how we’ve improved DataStage with this release.

DataStage on Cloud Pak for Data v4.5: New Features Release

It’s hard to believe it’s been seven months since the re-architected DataStage was released on Cloud Pak for Data in…

medium.com

1. New developer-centered visual pipeline editor to construct data flows

Since before our open public Beta was released as a SaaS service last December, our designers and developers have been hard at work building the drag-and-drop console for a web-based experience. We chose to develop our new, easy visual pipeline editor for creating data integration pipelines called flows on an open-source project (Elyra). Our designers iterated as our development team added functionality. Beta testers providing valuable input to help shape our product leading up to General Availability of IBM DataStage on Cloud Pak for Data as a Service in June 2021. This developer canvas is now available with DataStage on Cloud Pak for Data v4.0.2.

When developers build a data flow, they use a visual pipeline editor that interacts with a set of microservices and APIs underneath the covers. These microservices and APIs power the canvas (which provides the connectors and modular functions). The flow is compiled into executable code and then executed as jobs with the runtime components.

There is a new Asset Browser node in the palette. When you add the Asset Browser node to the canvas, developers can browse for files, schemas, and tables to select the asset you need — and DataStage does the rest of the work for creating the link to the Connection and populating the asset metadata in your flow. In addition, there are rich data visualizations built into DataStage so you can preview data and gather insights on data trends and patterns.

Our designers and developers built the new visual pipeline editor on Elyra. Use the Asset Browser to select the asset you need for your flow, which automatically adds a node with the connection and asset metadata to the canvas. Then, use rich data visualizations to gather insights into your data.

2. Highly scalable runtime with isolated workloads

DataStage in Cloud Pak for Data utilizes the best-in-breed, parallel (mustang) engine. Out-of-the-box, DataStage uses a distributed MPP containerized architecture which provides performance benefits since the workload is evenly balanced and spread across the available compute resources.

In this release, you can provision multiple DataStage instances. A parallel engine DataStage (PX) instance is the runtime environment that DataStage jobs run on within the same Cloud Pak for Data namespace. Administrators can restrict access to instances and allow only designated developers and workloads to run on each instance (marketing only utilizes their instance in separate containers than the accounting instance or the lakehouse team’s instance, etc.).

Administrators have full control with virtually unlimited scaling capabilities (horizontally and vertically) using the PX instance. In addition, no rework is needed for developers to run their jobs with smaller or larger environments (design once, run anywhere paradigm). Developers can process a million files, billions of records, as easily as processing a single file and scale the job to run with additional compute as needed.

A parallel engine (PX) DataStage instance is the runtime environment that DataStage jobs run on. In this video, the developer can specify the runtime environment for their flow (which then runs the job in specific instances). In addition, administrators can create new instances. Use the CLI to explore the PX instances and view the engine conductor (runtime) and compute pods.

3. Performance starting with data connectivity

DataStage enables both depth and breadth in data connectivity by supporting native connectivity and JDBC/ODBC. The native connectors unlock powerful data integration features such as bulk load and eliminate configuration hassles by providing ready-to-use connectors. They also are highly performant and support additional features specific to data integration pipelines. Generic JDBC/ODBC empowers connecting to various data sources, eliminating data silos across database systems. With DataStage on Cloud Pak for Data, you can also quickly bring your own JDBC drivers via a no-code interface for developers to use to connect to different data stores, enabling you to connect to Kafka, Cassandra, Hive, remote file systems through FTP, Cloud Object Storage, Salesforce.com, files, RDBMS tables, cloud data warehouses, NoSQL data stores, enterprise/ web apps, mainframe databases, and more.

DataStage includes out-of-the-box connectivity through native client libraries and ODBC drivers. JDBC also provides a high degree of extensibility.

4. Powerful pre-built functions

DataStage on Cloud Pak for Data v4.0.2 has core functions, available as nodes or stages for developers, which are critical for data engineering use cases (Sort, complex Lookup, Join, Funnel, Filter, Aggregation, String comparisons, Datatype conversions, etc.). A DataStage flow is constructed by linking these nodes (Stages and Connectors) together for many possible pipelines’ permutations. In addition, developers can group stages and links into modular components called subflows, making parts of the flow reusable in other flows.

One crucial stage is the Transformer, a processing stage with Data and Time, Key Break Detection, Logical, Mathematical, Null Handling, Number, Raw, String, Type Conversion, and Utility functions, the ability to declare and use your own variables, looping, and access to system variables. The Transformer experience is enhanced and now includes a powerful new derivation editor to construct complex transformation logic. In addition, the derivation editor provides in-line documentation and examples to facilitate quicker development.

A new auto-typing feature automatically populates the data type based on the output of your derivation. Auto-column propagation is another unique feature that allows you to modify metadata once and have those updates automatically propagate throughout the flow.

The Transformer is a powerful processing stage with critical functions and a derivation editor. *You will notice you do not spend a lot of time mapping columns due to* ***Auto-Column Propagation.*** *Also, check out how we can quickly modify flows and create subflows.*

5. Built-in migration microservice

The import and conversion of existing parallel and sequence jobs are paramount to the direction of containerized DataStage. Our team built a Migration microservice solely designed to facilitate the migration of existing DataStage projects with millions of DataStage jobs to the new experience. This microservice is an integrated component of DataStage v4.0.2. It translates existing DataStage parallel jobs and dependencies into the new DataStage JSON-based flow along with the supporting assets (table definitions, parameter sets, shared containers, jobs, connection information, etc.). It also compiles the flow in the new environment. In the target environment, administrators can edit and define additional environment variables at the environment level.

In addition, we have partnered with Data Migrators to provide MettleCI to DataStage customers and, through this offering to bring DevOps and CI/CD practices to migrations from stand-alone DataStage to containerized DataStage on Cloud Pak for Data.

You can import existing DataStage parallel jobs via a project archive file. The Migration microservice is invoked and will automatically migrate the job and dependencies to DataStage on Cloud Pak for Data v4.0.2.

6. Delightful enhancements for developers

Developers can use a fully documented set of APIs, SDKs, and command-line interface (CLI) to manage different parts of the DataStage deployment on Cloud Pak for Data. First, with the CLI (cpdctl dsjob), you can list projects, flows, jobs, hardware specifications, runtime environments, run jobs, print job logs, etc. The CLI is a lightweight utility allowing developers and administrators to orchestrate jobs from your laptop or your enterprise scheduler of choice. The canvas also allows developers to access full interactive execution logs to debug and build and modify flows.

The extensible framework of open APIs and SDKs (Python, Java, Node.js) are new and allow developers to interact with DataStage to process, create, compile, and run flows. DataStage flows are design-time assets that contain data integration logic in open-source JSON standards, so now developers can construct flows without even opening the canvas. These flows can easily be committed to source code repositories and used in CI/CD pipelines. In addition, developers can easily compare flows for changes and updates.

Get Started and bring the DataStage Parallel Engine to modern data architectures

Re-thinking DataStage to bring the parallel engine into a containerized architecture is part of our continuing effort to support the needs of developers and data engineers around the world. Our team believes that this shift and the modular infrastructure (runtime, compute, migration, canvas, and other microservices) will make managing DataStage efficient, highly scalable, simple to manage, and easy to deploy anywhere. In addition, the native capabilities of Kubernetes lend themselves well for DataStage. DataStage can elastically manage the running compute resources in the environment, scaling up or down the amount of compute pod replica resources based on the workload and evenly balancing jobs across the infrastructure with dynamic workload balancing.

We have already seen how we can innovate rapidly with the containerized architecture, adding new features and functionality to developers weekly in IBM DataStage on Cloud Pak for Data as a Service.

Get started with DataStage on Cloud Pak for Data as a Service today with the Multicloud Data Integration trial — a DataStage instance is automatically provisioned for you and you can run your first DataStage job in less than 5 minutes!

In the next couple of months, we will continue to bring features and new capabilities to DataStage developers on Cloud Pak for Data as we drive towards our goal of supporting tomorrow’s mission-critical workloads. For example, Watson Studio Pipelines, built off of Kubeflow pipelines on Tekton runtime, will allow DataStage developers to orchestrate their DataStage jobs (the migration microservice will facilitate migrating Sequence jobs from existing DataStage environments to the new Watson Pipeline Orchestration Flow). Additional stages and connectors are being designed and developed (such as for data quality use cases). Let us know what you think — we want to hear your feedback on what we can do better — so reach out to us on our Community Forum. To find out more about DataStage on Cloud Pak for Data, reach out to your IBM team.

Additional reading:

DataStage is an integral part of a Data Fabric — IBM was named a leader in The Forrester Wave™: Enterprise Data Fabric, Q2 2022.