CP4D-WKC (3.5.6) Installation Issue: 0072-iis Module Stuck Leading to DeadlineExceeded
Introduction
IBM® Cloud Pak for Data (CP4D) is a cloud-native solution that enables you to put your data to work quickly and efficiently. Cloud Pak for Data lets you do both by enabling you to connect to your data, govern it, find it, and use it for analysis. Cloud Pak for Data also enables all your data users to collaborate from a single, unified interface that supports many services that are designed to work together.
Watson Knowledge Catalog (WKC) is one of the components in CP4D that provides a secure enterprise Catalog management platform that is supported by a data governance framework. A Catalog connects people to the data and knowledge that they need. The data governance framework ensures that data access and data quality are compliant with your business rules and standards.
This blog outlines an issue related to Zookeeper component which surfaced while installing CP4D 3.5.6 WKC (Cloud Pak for Data — Watson Knowledge Catalog) for one of the customers and the workaround required to get past the issue. The environment specification is mentioned in the Environment section below.
The blog aims to provide a prescriptive guide to be followed in case a similar issue be faced in such an environment.
The Environment
Platform
Red Hat Open Shift Container Platform (version — 4.6.39)
Cluster
- 3 master nodes each having 8 CPU cores, 32 GB RAM
- 6 worker nodes each having 16 CPU Cores, 64 GB RAM
Virtualization
VMWare VSphere (Client v7.0.2)
Cloud Pak
Cloud Pak for Data — Version 3.5.6
NFS
Nutanix Files on CentOS 7
Ownership
Exports Configuration
Open Shift Project
Name: cpd35
ID Ranges
0072-iis Post Installation Job Issue
Symptoms
This module 0072-iis was stuck forever as shown below:
Cause
There was a scheduled job which was pending for more than 3 hours as shown in the below log snippet.
— time=”2021–08–28T09:27:36Z” level=info msg=”0072-iis Resource Status: Job: 3/4 — Pending: [iis-post-delete-job]”
oc describe job iis-post-delete-job
Volumes:
iis-post-delete-scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: iis-post-delete-config
Optional: false
Events:
Type Reason Age From Message
Normal SuccessfulDelete 177m job-controller Deleted pod: iis-post-delete-job-lb8j6
Warning DeadlineExceeded 177m (x2 over 177m) job-controller Job was active longer than specified deadline
Upon further investigation, found that two of the PVCs (Namely, 0072-iis-en-dedicated-pvc and iis-secrets-pv) associated with this module were in Terminating state not getting deleted. That led the job to get stuck.
Note: In the screen-capture, the status is Bound, however, it was Terminating when the issue took place.
Workaround
It required to patch the PVCs in question to remove the finalizers as follows:
oc patch pvc 0072-iis-en-dedicated-pvc -p ‘{“metadata”:{“finalizers”:null}}’
oc patch pvc iis-secrets-pv -p ‘{“metadata”:{“finalizers”:null}}’