Behind the Scene : Applying DeploymentConfig to Openshift Cluster

Rishabh Singh
17 min readJan 30, 2024

--

When learning about Kubernetes — we come across range of Kubernetes components, its purpose and how they are all connected. Learning the theoretical aspect comes in handy while working with Kubernetes cluster but just learning the theory might not stick and can be fleeting.

I love to understand complex system by breaking the system into pieces, working closely with each piece and putting them together like a jigsaw puzzle. This gives me the experiential learning — helping me connect the theory , the flow diagrams to actual process.

Perhaps I could best describe my experience of doing mathematics in terms of entering a dark mansion. You go into the first room and it’s dark, completely dark. You stumble around, bumping into the furniture. Gradually, you learn where each piece of furniture is. And finally, after six months or so, you find the light switch and turn it on. Suddenly, it’s all illuminated and you can see exactly where you were. Then you enter the next dark room

~ Andrew Miles

In this article I will attempt to bind the process of creating DeploymentConfig and ultimate creation of the Container. In this example I will use Openshift cluster (Red Hat build of Kubernetes) , where we will see how creation of a Keycloak DeploymentConfig creates a Keycloak Pod.

Index:
- Primitives
- Cgroup
- Namespaces
- Overlay Filesystem
- runc
- CRI-O

- What happens when we apply Deployment manifest ?
- Hands-On view : Creating DeploymentConfig
- Hands-On view : Cgroup and Namespace
- Conclusion

Primitives:

Cgroup

cgroup is a linux primitive and in context of Kubernetes cgroup is used to allocate or restrict resources such as CPU, Memory, IO, PID resources etc. When we are setting limits in Deployment object — ultimately Kubernetes creates cgroup slice for us setting the resource limit for the Pod.

Namespaces

Namespace is another linux primitive which was in common use even before container. Namespaces are used to partition kernel resources, allowing process to have its own isolated global system resource.

Creating a container in large part is isolating mount points, process ID, networking and IPC system resources for a process from other processes.

For brevity purpose I am not delving deeper into the subject of namespaces. Below is a sample unshare command that can be used to isolate IPC, UTS, Network, PID and Mount namespace. Then pivot_root can be used to pivot the root context to current — which should have the root filesystem to work as a container.

sudo unshare -puinm --fork --mount-proc  bash
sudo pivot_root . old_root
PATH=/usr/bin:$PATH

Overlay filesystem

Overlay filesystem is a type of filesystem where we can overlay one filesystem on top of other. Container Image utilize the overlay filesystem to create the root filesystem for the container. The biggest advantage of root filesystem is the reusability that it provides.

Overlay Filesystem combines lower filesystem with upper filesystem to create a merged filesystem. Crucial point to note here is that if any object is present in both lower and upper filesystem then only upper filesystem object will be visible.

Sample example of creating overlay filesystem:

mount -t overlay -o lowerdir=...,upperdir=....,workdir=... merged_dir

runc

runc is the core engine which spawns a container — provided that it has the required OCI bundle with all the required configurations. In my previous article I have tried to give some overview about how we could spawn a Keycloak container using runc — the same that we are trying to do in this article but with Openshift cluster.

CRI-O

Kubernetes has something called as Container Runtime Interface , which defines the specification to integrate Kubelet with Container Runtime like runc. CRI-O is one of the implementation of Container Runtime Interface (CRI).

Pods are a kubernetes concept consisting of one or more containers sharing the same IPC, NET and PID namespaces and living in the same cgroup.

What happens when we apply Deployment manifest ?

In this article we will see how creating a Deployment yaml results in creation of Container/Pod and how the creation request is processed at different Kubernetes endpoints.

Openshift Cluster Components — with its function
  • The request to create deployment yaml goes through the API server and the DeploymentConfig object is stored in etcd database.
Applying DeploymentConfig
  • The Openshift Controller Manager creates the ReplicationController in association with the created DeploymentConfig.
Creation of ReplicationController
  • The ReplicationController checks the desired number of replica and compares it with the current number of replica. It finds that no Pod is running and hence create the Pod definition in etcd database. The Pod remains in Pending state until it is scheduled.
Pod Creation
  • The Kubernetes Scheduler assigns the Pod to a specific Worker Node.
Scheduling Pod to specific worker node
  • Ultimately Kubelet which is a node agent running on each node — realize that the pod has to be created on its own node. Kubelet receives the PodSpec through API server and ensures that the defined container is running on the node. Kubelet initiates the creation of container by forwarding the request to the CRI-O daemon.
Kubelet initiates Container creation
  • Container Networking Interface will receive the ADD request from Container Runtime — which will enable connectivity between Container and Openshift Network. CNI utilize the containernetworking/plugins package to enable the connectivity.
  • cri-o utilize containers/image and containers/storage package to pull image and unpack the image in root filesystem of the container https://github.com/opencontainers/image-spec/blob/main/spec.md
  • cri-o then generate OCI runtime specification — this OCI runtime spec is used to actually start the container. https://github.com/opencontainers/runtime-spec/blob/main/spec.md
  • OCI runtime (runc in this case) will start the container using OCI bundle created in previous step.
  • Pod switch to Running state when Containers (part of the Pod) starts Running.
Pod Running

This article is my labor of love to understand the internal working of Pod/Container creation in Openshift Cluster. I have tried to stitch together my understanding through logs and command line utilities like crictl , runc , nsenter . This process has been an enriching experience — which helped me look past the abstract and delve into the concrete world of Kubernetes. I hope you found this article useful. Happy Learning!!

Hands-On view : Creating DeploymentConfig

Now we have some idea about the steps that create the container from a DeploymentConfig manifest — lets try to get more involved view from the Openshift cluster.

1 . We will first apply the DeploymentConfig like below which will send a request to the API server.

$ oc process -f keycloak.yaml \
-p KEYCLOAK_ADMIN=admin \
-p KEYCLOAK_ADMIN_PASSWORD=admin \
-p NAMESPACE=rhbk-deployment \
-p APPLICATION_NAME=test-keycloak| oc create -f -
I0126 17:34:23.061911 130569 loader.go:373] Config loaded from file: /home/rishabh/.kube/config
I0126 17:34:23.151147 130569 round_trippers.go:463] POST https://api.rishabhcluster1.xxxxx.com:6443/apis/apps.openshift.io/v1/namespaces/rhbk-deployment/deploymentconfigs?fieldManager=kubectl-create&fieldValidation=Ignore
I0126 17:34:23.151157 130569 round_trippers.go:469] Request Headers:
I0126 17:34:23.151163 130569 round_trippers.go:473] User-Agent: oc/4.13.0 (linux/amd64) kubernetes/e251b5e
I0126 17:34:23.151173 130569 round_trippers.go:473] Authorization: Bearer <masked>
I0126 17:34:23.151176 130569 round_trippers.go:473] Accept: application/json
I0126 17:34:23.151180 130569 round_trippers.go:473] Content-Type: application/json
I0126 17:34:23.250471 130569 round_trippers.go:574] Response Status: 201 Created in 99 milliseconds
deploymentconfig.apps.openshift.io/test-keycloak created

The API server logs an event of ResponseComplete creating DeploymentConfig named test-keycloak.

$ cat /var/log/openshift-apiserver/audit.log | grep test-keycloak | grep verb | grep create | jq

{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"auditID": "37cce84b-9003-4770-a025-da409e8ec481",
"stage": "ResponseComplete",
"requestURI": "/apis/apps.openshift.io/v1/namespaces/rhbk-deployment/deploymentconfigs?fieldManager=kubectl-create&fieldValidation=Ignore",
"verb": "create",
"user": {
"username": "kube:admin",
"groups": [
"system:cluster-admins",
"system:authenticated"
],
"extra": {
"scopes.authorization.openshift.io": [
"user:full"
]
}
},
"sourceIPs": [
"10.74.212.197",
"10.130.0.1"
],
"userAgent": "oc/4.13.0 (linux/amd64) kubernetes/e251b5e",
"objectRef": {
"resource": "deploymentconfigs",
"namespace": "rhbk-deployment",
"name": "test-keycloak",
"apiGroup": "apps.openshift.io",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 201
},
"requestReceivedTimestamp": "2024-01-26T12:04:23.219072Z",
"stageTimestamp": "2024-01-26T12:04:23.242562Z",
"annotations": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"cluster-admins\" of ClusterRole \"cluster-admin\" to Group \"system:cluster-admins\""
}
}

2. Replication Controller was created by Openshift Controller Manager as soon as Deployment Config is created

$ oc project openshift-controller-manager
$ oc logs controller-manager-xxxx
I0126 12:04:23.273401 1 event.go:285] Event(v1.ObjectReference{Kind:"DeploymentConfig",
Namespace:"rhbk-deployment", Name:"test-keycloak", UID:"75ade0a8-e180-4dce-a826-fe2cb10b5224", APIVersion:"apps.openshift.io/v1",
ResourceVersion:"23447799", FieldPath:""}): type: 'Normal' reason: 'DeploymentCreated'
Created new replication controller "test-keycloak-1" for version 1

Replication Controller compares the desired and current number of pods — it finds that the pod is not running and hence creates a Pod.

$ oc project openshift-kube-controller-manager
$ oc logs kube-controller-manager-master-x.rishabhcluster1.xxxxx.com
I0126 12:04:25.264455 1 replica_set.go:577] "Too few replicas" replicaSet="rhbk-deployment/test-keycloak-1" need=1 creating=1
I0126 12:04:25.282983 1 event.go:294] "Event occurred" object="rhbk-deployment/test-keycloak-1" fieldPath="" kind="ReplicationController" apiVersion="v1" type="Normal" reason="SuccessfulCreate" message="Created pod: test-keycloak-1-qjdzp"

We can list the DeploymentConfig, ReplicationController and Pod objects in etcd pod like below :

$ oc project openshift-etcd
$ oc rsh etcd-master-x.rishabhcluster1.xxxxxxx.com
sh-4.4# etcdctl get / --prefix --keys-only | grep -i test-keycloak
/kubernetes.io/controllers/rhbk-deployment/test-keycloak-1
/kubernetes.io/pods/rhbk-deployment/test-keycloak-1-qjdzp
/openshift.io/deploymentconfigs/rhbk-deployment/test-keycloak

3. The kube scheduler bounds the pod to a specific node. Keycloak utilize a deploy pod to create the actual Keycloak pod. We will focus only on test-keycloak-1-qjdzp pod — that is the actual pod. The test-keycloak-1-deploy will go through the same process of pod creation.

$ oc project openshift-kube-scheduler
$ oc logs openshift-kube-scheduler-guard-master-x.xxxxxx.com
I0126 12:04:23.302190 1 schedule_one.go:266] "Successfully bound pod to node" pod="rhbk-deployment/test-keycloak-1-deploy" node="worker-0.rishabhcluster1.xxxxxx.com" evaluatedNodes=6 feasibleNodes=3
I0126 12:04:25.290493 1 schedule_one.go:266] "Successfully bound pod to node" pod="rhbk-deployment/test-keycloak-1-qjdzp" node="worker-0.rishabhcluster1.xxxxxx.com" evaluatedNodes=6 feasibleNodes=3

4. Kubelet on checking finds out that the test-keycloak pod is assigned to its node and starts the actual process of Container creation. The SyncLoop ADD initiates the process of Container creation.

Jan 26 12:04:25 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:04:25.289978    1649 kubelet.go:2098] "SyncLoop ADD" source="api" pods=[rhbk-deployment/test-keycloak-1-qjdzp]

#CRI receiving Pod network and interface(10.128.3.202/23) is added to the Pod

Jan 26 12:04:25 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:25.614789032Z" level=info msg="Got pod network &{Name:test-keycloak-1-qjdzp Namespace:rhbk-deployment ID:dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c UID:5a6b7593-905e-406b-9886-46353bd5174f NetNS:/var/run/netns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Jan 26 12:04:25 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:25.614834628Z" level=info msg="Adding pod rhbk-deployment_test-keycloak-1-qjdzp to CNI network \"multus-cni-network\" (type=multus)"
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: 2024-01-26T12:04:27Z [verbose] Add: rhbk-deployment:test-keycloak-1-qjdzp:5a6b7593-905e-406b-9886-46353bd5174f:openshift-sdn(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59"}],"ips":[{"version":"4","interface":0,"address":"10.128.3.202/23","gateway":"10.128.2.1"}],"dns":{}}
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: I0126 12:04:27.417540 257860 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"rhbk-deployment", Name:"test-keycloak-1-qjdzp", UID:"5a6b7593-905e-406b-9886-46353bd5174f", APIVersion:"v1", ResourceVersion:"23447836", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.202/23] from openshift-sdn
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:27.437674129Z" level=info msg="Got pod network &{Name:test-keycloak-1-qjdzp Namespace:rhbk-deployment ID:dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c UID:5a6b7593-905e-406b-9886-46353bd5174f NetNS:/var/run/netns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:27.437844438Z" level=info msg="Checking pod rhbk-deployment_test-keycloak-1-qjdzp for CNI network multus-cni-network (type=multus)"
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:04:27.438281 1649 kubelet.go:2105] "SyncLoop UPDATE" source="api" pods=[rhbk-deployment/test-keycloak-1-qjdzp]


# CRI creating the container using /runtime.v1.RuntimeService/CreateContainer rpc
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:27.447731103Z" level=info msg="Creating container: rhbk-deployment/test-keycloak-1-qjdzp/test-keycloak" id=3ce437be-ed70-4427-bc7b-d19702cc7e9f name=/runtime.v1.RuntimeService/CreateContainer
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:04:27.497834 1649 kubelet.go:2136] "SyncLoop (PLEG): event for pod" pod="rhbk-deployment/test-keycloak-1-qjdzp" event=&{ID:5a6b7593-905e-406b-9886-46353bd5174f Type:ContainerStarted Data:dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c}

# Container created
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:27.576238185Z" level=info msg="Created container 000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4: rhbk-deployment/test-keycloak-1-qjdzp/test-keycloak" id=3ce437be-ed70-4427-bc7b-d19702cc7e9f name=/runtime.v1.RuntimeService/CreateContainer

# Container started
Jan 26 12:04:27 worker-0.rishabhcluster1.xxxxxx.com crio[1614]: time="2024-01-26 12:04:27.587014151Z" level=info msg="Started container" PID=257939 containerID=000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4 description=rhbk-deployment/test-keycloak-1-qjdzp/test-keycloak id=85c5a0ac-322d-42f7-9fa6-2b012e004031 name=/runtime.v1.RuntimeService/StartContainer sandboxID=dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c
Jan 26 12:04:28 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:04:28.501702 1649 kubelet.go:2136] "SyncLoop (PLEG): event for pod" pod="rhbk-deployment/test-keycloak-1-qjdzp" event=&{ID:5a6b7593-905e-406b-9886-46353bd5174f Type:ContainerStarted Data:000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4}
Jan 26 12:04:28 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:04:28.502349 1649 kubelet.go:2208] "SyncLoop (probe)" probe="readiness" status="" pod="rhbk-deployment/test-keycloak-1-qjdzp"
Jan 26 12:05:05 worker-0.rishabhcluster1.xxxxxx.com hyperkube[1649]: I0126 12:05:05.825043 1649 kubelet.go:2208] "SyncLoop (probe)" probe="readiness" status="ready" pod="rhbk-deployment/test-keycloak-1-qjdzp"

5. The CNI uses the ADD operation to add created container to the network and receive interface IP.

$ oc project openshift-sdn

I0126 12:04:23.730061 24418 pod.go:535] CNI_ADD rhbk-deployment/test-keycloak-1-deploy got IP 10.128.3.201, ofport 458
I0126 12:04:25.710001 24418 pod.go:535] CNI_ADD rhbk-deployment/test-keycloak-1-qjdzp got IP 10.128.3.202, ofport 459
I0126 12:05:07.725829 24418 pod.go:571] CNI_DEL rhbk-deployment/test-keycloak-1-deploy

6. cri-o is the Container Runtime Interface of the Openshift cluster — hence we can list the Pod using crictl — which is a command line tool for CRI — compatible container runtimes.

# crictl ps | grep test-keycloak 
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
000a59eeefb32 355cf15f973ac0dd192200deb60dfe71f3886f7f613f2034def26102eff29767 About an hour ago Running test-keycloak 0 dd01f93c77e26 test-keycloak-1-qjdzp

crictl inspect gives us some interesting information like below:


# crictl inspect 000a59eeefb32
{
"status": {
..... "name": "test-keycloak"
},
"state": "CONTAINER_RUNNING",
"createdAt": "2024-01-26T12:04:27.536352467Z",
"startedAt": "2024-01-26T12:04:27.586995636Z",
"finishedAt": "0001-01-01T00:00:00Z",
.....
"imageRef": "registry.redhat.io/rhbk/keycloak-rhel9@sha256:7a4e96d7b7b1d25bcc5ce00ea6c5d8d609e9c3368b60972c3995011565ffe5c8",
.....
"mounts": [
{
"containerPath": "/etc/hosts",
.....
],
"logPath": "/var/log/pods/rhbk-deployment_test-keycloak-1-qjdzp_5a6b7593-905e-406b-9886-46353bd5174f/test-keycloak/0.log"
},
"info": {
"sandboxID": "dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c",
"pid": 257939,
"runtimeSpec": {
"ociVersion": "1.0.2-dev",
"process": {
"user": {
......
"args": [
"/opt/keycloak/bin/kc.sh",
"start-dev"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm",
"HOSTNAME=test-keycloak-1-qjdzp",
"NSS_SDB_USE_CACHE=no",
"KEYCLOAK_ADMIN=admin",
"KEYCLOAK_ADMIN_PASSWORD=admin",
.......
"root": {runc
"path": "/var/lib/containers/storage/overlay/92f83153a72546a08ac96cff1140e00ef52a4fc155d095818d7ab3850b7df0c8/merged"
},
.......
"cgroupsPath": "kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice:crio:000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4",
"namespaces": [
{
"type": "pid"
},
{
"type": "network",
"path": "/var/run/netns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59"
},
{
"type": "ipc",
"path": "/var/run/ipcns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59"
},
{
"type": "uts",
"path": "/var/run/utsns/d84334e6-e060-4ca4-9b1c-169b8f7eaf59"
},
{
"type": "mount"
}
],

We can identify the root filesystem over which the container was spawned. This root filesystem is the merged directory generated by merging the different layers of the image.

"root": {
"path": "/var/lib/containers/storage/overlay/92f83153a72546a08ac96cff1140e00ef52a4fc155d095818d7ab3850b7df0c8/merged"
},

Lets see the content under the merged directory. It has the Keycloak file structure(basically the KEYCLOAK_HOME) required to start a Keycloak instance.

[root@worker-0 ~]# cd /var/lib/containers/storage/overlay/92f83153a72546a08ac96cff1140e00ef52a4fc155d095818d7ab3850b7df0c8/merged
[root@worker-0 merged]# ls -rlt
total 0
drwxr-xr-x. 1 root root 6 Aug 9 2021 srv
lrwxrwxrwx. 1 root root 8 Aug 9 2021 sbin -> usr/sbin
dr-xr-x---. 1 root root 23 Aug 9 2021 root
drwxr-xr-x. 1 root root 6 Aug 9 2021 mnt
drwxr-xr-x. 1 root root 6 Aug 9 2021 media
lrwxrwxrwx. 1 root root 9 Aug 9 2021 lib64 -> usr/lib64
lrwxrwxrwx. 1 root root 7 Aug 9 2021 lib -> usr/lib
drwxr-xr-x. 1 root root 6 Aug 9 2021 home
dr-xr-xr-x. 1 root root 6 Aug 9 2021 boot
lrwxrwxrwx. 1 root root 7 Aug 9 2021 bin -> usr/bin
dr-xr-xr-x. 1 root root 6 Aug 9 2021 afs
dr-xr-xr-x. 1 root root 6 Jan 4 18:13 sys
dr-xr-xr-x. 1 root root 6 Jan 4 18:13 proc
drwxr-xr-x. 1 root root 144 Jan 4 18:13 usr
drwxr-xr-x. 1 root root 18 Jan 4 18:13 dev
drwxr-xr-x. 1 root root 219 Jan 4 18:14 var
drwxr-xr-x. 1 root root 22 Jan 4 18:14 opt
drwxr-xr-x. 1 root root 18 Jan 26 12:04 etc
drwxr-xr-x. 1 root root 27 Jan 26 12:04 run
drwxrwxrwt. 1 root root 54 Jan 26 12:04 tmp
[root@worker-0 merged]# cd opt/keycloak/
bin/ conf/ data/ lib/ providers/ themes/
[root@worker-0 merged]# cd opt/keycloak/
bin/ conf/ data/ lib/ providers/ themes/
[root@worker-0 merged]# cd opt/keycloak/bin/
[root@worker-0 bin]# ls -rlt
total 36
-rwxrwxr-x. 1 core root 918 Jan 4 17:20 kcreg.sh
-rw-rw-r--. 1 core root 319 Jan 4 17:20 kcreg.bat
-rwxrwxr-x. 1 core root 898 Jan 4 17:20 kcadm.sh
-rw-rw-r--. 1 core root 298 Jan 4 17:20 kcadm.bat
-rwxrwxr-x. 1 core root 5020 Jan 4 17:20 kc.sh
-rwxrwxr-x. 1 core root 5896 Jan 4 17:20 kc.bat
-rwxrwxr-x. 1 core root 1015 Jan 4 17:20 federation-sssd-setup.sh
drwxrwxr-x. 3 core root 131 Jan 4 17:32 client
[root@worker-0 bin]#

7. Ultimately it is the runc that is called by cri-o to spawn the container. runc utilize the OCI bundle that is created by cri-o in previous steps. We can list the container using runc cli as well. We can grep for container ID or Process ID that is identified by crictl inspect output.

[root@worker-0 bin]# runc list | grep 257939 
000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4 257939 running /run/containers/storage/overlay-containers/000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4/userdata 2024-01-26T12:04:27.536352467Z root

[root@worker-0 bin]# runc ps 000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4
UID PID PPID C STIME TTY TIME CMD
1000810+ 257939 257917 0 12:04 ? 00:01:07 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M
-XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8
-Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom
-XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20
-XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.security=ALL-UNNAMED
-Dkc.home.dir=/opt/keycloak/bin/..
-Djboss.server.config.dir=/opt/keycloak/bin/../conf
-Djava.util.logging.manager=org.jboss.logmanager.LogManager
-Dquarkus-log-max-startup-records=10000
-cp /opt/keycloak/bin/../lib/quarkus-run.jar
io.quarkus.bootstrap.runner.QuarkusEntryPoint --profile=dev start-dev
[root@worker-0 bin]#

8. Once the Keycloak container is Running and Ready — the keycloak pod also switch to Running status.

$ oc get pods
NAME READY STATUS RESTARTS AGE
test-keycloak-1-qjdzp 1/1 Running 0 44h

Hands-On view : Cgroup and Namespace

cgroup

Openshift utilize the cgroupfs(i.e. under /sys/fs/cgroup) to allocate resources to Pod. The cgroup for each resource will be created under kubepods.slice. There will be directory for each pod and within Pod directory there will be sub-directories for each container like crio-<container_id>.scope

We can identify the cgroup path for our specific container using crictl inspect like below:


[root@worker-0 ~]# crictl inspect 000a59eeefb32 | grep cgroupsPath
"cgroupsPath": "kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice:crio:000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4",
[root@worker-0 ~]#
[root@worker-0 bin]# systemd-cgls -u kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
Unit kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice (/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice):
├─crio-conmon-000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4.scope
│ └─257917 /usr/bin/conmon -b /run/containers/storage/overlay-containers/000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4/userdata -c 000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4 --exit-dir /var/run/cr>
└─crio-000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4.scope
└─257939 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitO>

The complete directory structure will look like following — where we can see the slice for each resources like CPU, Memory. And if we go in specific resource sub-directory — like for memory we can see the current utilization or the memory limit that might be set at DeploymentConfig level.

[root@worker-0 bin]# find /sys/fs/cgroup/ -name kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice                                     
/sys/fs/cgroup/hugetlb/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/blkio/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/rdma/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/perf_event/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice
/sys/fs/cgroup/systemd/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice

#Going into the memory subdirectory

[root@worker-0 kubepods-besteffort-pod5a6b7593_905e_406b_9886_46353bd5174f.slice]# ls -rlt
total 0
-rw-r--r--. 1 root root 0 Jan 26 12:04 tasks
.......
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.oom_control
-r--r--r--. 1 root root 0 Jan 26 12:04 memory.numa_stat
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.move_charge_at_immigrate
-r--r--r--. 1 root root 0 Jan 26 12:04 memory.memsw.usage_in_bytes
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.memsw.max_usage_in_bytes
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.memsw.limit_in_bytes
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.memsw.failcnt
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.max_usage_in_bytes
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.limit_in_bytes
........
-rw-r--r--. 1 root root 0 Jan 26 12:04 cgroup.procs
--w--w--w-. 1 root root 0 Jan 26 12:04 cgroup.event_control
-rw-r--r--. 1 root root 0 Jan 26 12:04 cgroup.clone_children
drwxr-xr-x. 2 root root 0 Jan 26 12:04 crio-dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c
drwxr-xr-x. 2 root root 0 Jan 26 12:04 crio-conmon-000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4.scope
drwxr-xr-x. 2 root root 0 Jan 26 12:04 crio-000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4.scope

#Going into specific keycloak container subdirectory --- crio-<container_id>.scope

[root@worker-0 crio-000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4.scope]# ls -rlt
total 0
-rw-r--r--. 1 root root 0 Jan 26 12:04 tasks
.......
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.soft_limit_in_bytes
----------. 1 root root 0 Jan 26 12:04 memory.pressure_level
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.oom_control
.......
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.max_usage_in_bytes
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.limit_in_bytes
.......
--w-------. 1 root root 0 Jan 26 12:04 memory.force_empty
-rw-r--r--. 1 root root 0 Jan 26 12:04 memory.failcnt
-rw-r--r--. 1 root root 0 Jan 26 12:04 cgroup.procs
--w--w--w-. 1 root root 0 Jan 26 12:04 cgroup.event_control
-rw-r--r--. 1 root root 0 Jan 26 12:04 cgroup.clone_children

Namepsace

Each Pod is isolated from the other Pods running in Openshift cluster and has its own isolated kernel resources. This isolation is achieved using Namespaces. Lets see the different namespaces that are created for our Pod.

We can identify the list of namespaces associated with the container usinglsns and passing the PID of the container.

[root@worker-0 ~]# lsns -p 257939
NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 172 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531837 user 171 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026532594 uts 1 257939 1000810000 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Ds
4026532595 ipc 1 257939 1000810000 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Ds
4026532597 net 1 257939 1000810000 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Ds
4026532665 mnt 1 257939 1000810000 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Ds
4026532668 pid 1 257939 1000810000 java -Dkc.config.built=true -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Ds
[root@worker-0 ~]#

In above lsns output we can see all the namespaces - mnt , pid , ipc , net, utc.

Lets see the list of PID in pid namespace — since ps command is not available in the container I am using /proc directory to list the number of process.

In nsenter we will pass the PID of the container and with -p we will enter into PID namespace. -r option is required to set the root directory to the root directory of target process/namespace.

[root@worker-0 ~]# nsenter -t 257939 -p -r ls -1 /proc/ | egrep "([0-9]+)"
1
743
[root@worker-0 ~]#

#Ignore the grep process - we only have 1 process running lets see the process detail:

[root@worker-0 ~]# nsenter -t 257939 -p -r cat /proc/1/cmdline
java-Dkc.config.built=true-Xms64m-Xmx512m-XX:MetaspaceSize=96M-XX:MaxMetaspaceSize=256m-Dfile.encoding=UTF-8-Dsun.stdout.encoding=UTF-8-Dsun.err.encoding=UTF-8-Dstdout.encoding=UTF-8-Dstderr.encoding=UTF-8-XX:+ExitOnOutOfMemoryError-Djava.security.egd=file:/dev/urandom-XX:+UseParallelGC-XX:MinHeapFreeRatio=10-XX:MaxHeapFreeRatio=20-XX:GCTimeRatio=4-XX:AdaptiveSizePolicyWeight=90--add-opens=java.base/java.util=ALL-UNNAMED--add-opens=java.base/java.util.concurrent=ALL-UNNAMED--add-opens=java.base/java.security=ALL-UNNAMED-Dkc.home.dir=/opt/keycloak/bin/..-Djboss.server.config.dir=/opt/keycloak/bin/../conf-Djava.util.logging.manager=org.jboss.logmanager.LogManager-Dquarkus-log-max-startup-records=10000-cp/opt/keycloak/bin/../lib/quarkus-run.jario.quarkus.bootstrap.runner.QuarkusEntryPoint--profile=devstart-dev[root@worker-0 ~]#

nenter into UTS namespace to check the hostname of the container.

[root@worker-0 ~]# nsenter -t 257939 -u hostname               
test-keycloak-1-qjdzp
[root@worker-0 ~]#

To enter network namespace we can use -n and if we list the ip route, we can see Pod network.

[root@worker-0 ~]# nsenter -t 257939 -n ip route
default via 10.128.2.1 dev eth0
10.128.0.0/14 dev eth0
10.128.2.0/23 dev eth0 proto kernel scope link src 10.128.3.202
172.30.0.0/16 via 10.128.2.1 dev eth0
224.0.0.0/4 dev eth0
[root@worker-0 ~]#

We can also list the mount points in a container using -a option which allows us to enter all namespaces of the target process.

[root@worker-0 ~]# nsenter -t 257939 -a  -r cat /proc/self/mountinfo | grep -E 'overlay|tmpfs'                            
3010 2718 0:290 / / rw,relatime - overlay overlay rw,context="system_u:object_r:container_file_t:s0:c27,c28",lowerdir=/var/lib/containers/storage/overlay/l/GZEGLUHSNCMWHAZRCAHMAM4YIG:/var/lib/containers/storage/overlay/l/7RQFKCPUYME4KJS7XA7GXWJYZX,upperdir=/var/lib/containers/storage/overlay/92f83153a72546a08ac96cff1140e00ef52a4fc155d095818d7ab3850b7df0c8/diff,workdir=/var/lib/containers/storage/overlay/92f83153a72546a08ac96cff1140e00ef52a4fc155d095818d7ab3850b7df0c8/work,volatile
3012 3010 0:293 / /dev rw,nosuid - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k,mode=755
3016 3015 0:296 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",mode=755
3029 3012 0:286 / /dev/shm rw,nosuid,nodev,noexec,relatime master:594 - tmpfs shm rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k
3030 3010 0:24 /containers/storage/overlay-containers/dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c/userdata/resolv.conf /etc/resolv.conf rw,nosuid,nodev,noexec master:29 - tmpfs tmpfs rw,seclabel,mode=755
3031 3010 0:24 /containers/storage/overlay-containers/dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c/userdata/hostname /etc/hostname rw,nosuid,nodev master:29 - tmpfs tmpfs rw,seclabel,mode=755
3032 3010 0:24 /containers/storage/overlay-containers/dd01f93c77e26569a698ee02d675d67723188d492f10c6e507cff11c0199130c/userdata/.containerenv /run/.containerenv rw,nosuid,nodev master:29 - tmpfs tmpfs rw,seclabel,mode=755
3033 3010 0:24 /containers/storage/overlay-containers/000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4/userdata/passwd /etc/passwd rw,nosuid,nodev,noexec master:29 - tmpfs tmpfs rw,seclabel,mode=755
3036 3010 0:24 /containers/storage/overlay-containers/000a59eeefb320b610a55f2c8ddf806701d282a64a58df58551a7ee551a538b4/userdata/run/secrets /run/secrets rw,nosuid,nodev - tmpfs tmpfs rw,seclabel,mode=755
3038 3036 0:285 / /run/secrets/kubernetes.io/serviceaccount ro,relatime - tmpfs tmpfs rw,seclabel,size=6995260k
2656 3011 0:297 / /proc/acpi ro,relatime - tmpfs tmpfs ro,context="system_u:object_r:container_file_t:s0:c27,c28"
2658 3011 0:293 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k,mode=755
2659 3011 0:293 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k,mode=755
2660 3011 0:293 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k,mode=755
2674 3011 0:293 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,context="system_u:object_r:container_file_t:s0:c27,c28",size=65536k,mode=755
2716 3011 0:298 / /proc/scsi ro,relatime - tmpfs tmpfs ro,context="system_u:object_r:container_file_t:s0:c27,c28"
2719 3015 0:299 / /sys/firmware ro,relatime - tmpfs tmpfs ro,context="system_u:object_r:container_file_t:s0:c27,c28"
[root@worker-0 ~]#

Conclusion

This article is my labor of love to understand the internal working of Pod/Container creation in Openshift Cluster. I have tried to stitch together my understanding through logs and command line utilities like crictl , runc , nsenter . This process has been an enriching experience — which helped me look past the abstract and delve into the concrete world of Kubernetes. I hope you found this article useful. Happy Learning!!

--

--

Rishabh Singh

Support Engineer at Red Hat | Writes about Security, Cloud Native Development | Philomath