Topology Of DataStage With Kubernetes

Published in

Quick Code

3 min readFeb 4, 2020

IBM® InfoSphere® Information Server products that are installed in a containerized environment provide an experience that is simpler, has better performance and scalability, and is less prone to error for processes such as in-place upgrades.

The deployment of InfoSphere DataStage® in a containerized environment makes use of two components: Docker and Kubernetes. The deployment of IBM BigIntegrate uses only Docker.

Docker provides the platform for application deployment by using Docker images and containers, while Kubernetes automates deploying, scaling, and operating application containers. Engine/Services/Repository tiers are deployed as containers within their respective pods More Additional Information At DataStage Online Training

Basic concepts: Docker

Docker containers provide a light-weight virtual platform that comes with base operating systems such as CentOS. Compared to traditional application-on-host environments, containers are smaller and faster. Docker containers provide virtualization on the operating system level. In the Docker image, the artifacts for InfoSphere DataStage or IBM BigIntegrate, for example, the Kerberos ticket for the IBM BigIntegrate service on Hadoop, are already included and configured to work right away. By using standard Docker commands, the Docker container of InfoSphere DataStage or IBM BigIntegrate can be brought up or taken down.

Basic concepts: Kubernetes

Kubernetes is an open source platform for managing containerized services. Kubernetes can easily orchestrate and deploy containers. Support tools for the Kubernetes platform, including monitoring, are widely available.

Topology for DataStage with Kubernetes

InfoSphere DataStage, which includes the Data Flow Designer UI (Client Tier), the APIs (Server Tier), and the execution engine (Engine Tier) that support it, is run in a Kubernetes-enabled Docker container environment. Kubernetes is a system that functions as a “container orchestrator” or “cluster manager” and provides mechanisms for deploying, maintaining, and scaling applications. Kubernetes places containers on nodes and enables pods to find each other. Kubernetes features basic monitoring, logging, health checking, and automatic recovery from failure.

The following diagram illustrates the topology.

Figure 1. InfoSphere DataStage containerized topology

Installation artifacts for InfoSphere DataStage

The deployment of InfoSphere DataStage includes the following artifacts:

NamespaceNamespace provides a scope for names to be unique within it. Multiple instances of IBM InfoSphere Information Server suites representative of several dedicated environments can run in their own namespace, for example, DEV / TEST / PROD.Persistent volumesPersistent volumes function as a slice of the storage that is provisioned for use by the containers. The file system type can be NFS, glusterfs, or others.Persistent volume claimA persistent volume claim is a request for storage from a persistent volume. You can have one or many claims for storage from a persistent volume until the persistent volume runs out of space. The following example shows the persistent volume claims for the namespace test-1:

Take your career to new heights of success with a DataStage, enroll for live free demo on DataStage Training

[root@propeller1 YAML_Scripts]# kubectl get pvc -n test-1xmeta-pv-volume-claim    Boundxmeta-pv-volume    10Gi       RWO

Topology for IBM BigIntegrate

The following diagram illustrates the topology for IBM BigIntegrate.

Figure 2. IBM BigIntegrate containerized topology

The InfoSphere Information Server engine tier Docker (Conductor) is installed on the Hadoop Edge Node with YARN Client. All of the other InfoSphere Information Server Docker tiers can be on the Edge node or outside of the cluster. There is no Kubernetes cluster since a Hadoop cluster already exists. InfoSphere Information Server binaries are on all data nodes that will run InfoSphere DataStage jobs. InfoSphere Information Server binaries are copied to data nodes at job run time using HDFS if binaries do not already exist