Behind the Scene : Creating a Service in Openshift Cluster
In this article we will try to understand how creating a Service expose list of Pods in Openshift cluster. A Deployment object can spawn n number of Pods and it is the Service object that tracks the list of Pod using labels and abstract those Pods through a single endpoint of Service.
Before we create a Service lets discuss some basic Networking ideas within Openshift
Different Networks in Openshift
- Pod Network (
clusterNetwork) : This is the subnet from which the IP address for Pod is allocated. - Service Network(
serviceNetwork) : This is the subnet from which the IP address for Service is allocated. - Host Network : The host network is the network over which the Openshift/Kubernetes cluster is installed.
We can see the configuration for clusterNetwork and serviceNetwork in network.config/cluster object.
$ oc describe network.config/cluster
.....
Spec:
Cluster Network:
Cidr: 10.128.0.0/14
Host Prefix: 23
Network Type: OpenShiftSDN
Service Network:
172.30.0.0/16Pod network
In process of Pod creation, the container runtime calls the CNI plugin with ADD to create the network interface for the container. The default CNI provider in Openshift cluster v4.13 is Openshift SDN.
Below are the openshift sdn logs where the Pod is assigned with a specific IP from pod network.
I0227 09:47:56.491415 1744 pod.go:535]
CNI_ADD rhbk-deployment/test-keycloak-1-9bk6q got IP 10.128.3.240, ofport 995
#Second Pod which was added later:
I0227 10:15:08.694737 1965 pod.go:535]
CNI_ADD rhbk-deployment/test-keycloak-1-zfk2g got IP 10.129.3.57, ofport 2187Container Runtime receives and maintains the network interface of the container.
Feb 27 09:47:58 worker-0.rishabhcluster.xxxxxxxxx crio[1450]: 2024-02-27T09:47:58Z [verbose]
Add: rhbk-deployment:test-keycloak-1-9bk6q:13ab1f06-c825-4512-a2d9-ae0e2d46db31:
openshift-sdn(openshift-sdn):
eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/195728bd-2a65-46de-9a1a-e272cc13d43d"}],
"ips":[{"version":"4","interface":0,"address":"10.128.3.240/23","gateway":"10.128.2.1"}],"dns":{}}
Feb 27 09:47:58 worker-0.rishabhcluster.xxxxxxxxx crio[1450]:
I0227 09:47:58.083680 493602 event.go:282]
Event(v1.ObjectReference{Kind:"Pod", Namespace:"rhbk-deployment",
Name:"test-keycloak-1-9bk6q", UID:"13ab1f06-c825-4512-a2d9-ae0e2d46db31",
APIVersion:"v1", ResourceVersion:"22230049", FieldPath:""}):
type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.240/23]
from openshift-sdnOpenShift SDN Network Devices:
Bridge network Device: The bridge will be directly connected with the Pod veth endpoint. Any packet from Pod will first reach to Bridge.
Tunnel Interface: The tunnel interface helps in connecting with external network. Bridge network device delegate the decision taking and Network Address Translation to Tunnel Interface. Tunnel interface utilize iptables rules to enable access to external network through SNAT and selecting Pod IP through DNAT. In this article we will be going through these iptable rules.
Vxlan : Openshift cluster has multiple nodes. Vxlan comes into picture when a Pod on one node tries to connect with another Pod on other node.
The Vxlan, Tunnel and Pods veth endpoints are all connected with Bridge.
# Vxlan -------------------------------------------------------- # journalctl | grep vxlan Feb 01 10:35:58 worker-0.rishabhcluster.xxxxx NetworkManager[1118]: <info> [1706783758.5817] manager: (vxlan0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/6)Feb 01 10:35:58 worker-0.rishabhcluster.xxxxx ovs-vswitchd[1105]:
ovs|00038|bridge|INFO|bridge br0: added interface vxlan0 on port 1#Tun0 ----------------------------------------------------------
# journalctl | grep tun0Feb 01 10:35:58 worker-0.rishabhcluster.xxxxx NetworkManager[1118]:
<info> [1706783758.6180] manager: (tun0): new Open vSwitch Port device
(/org/freedesktop/NetworkManager/Devices/9)Feb 01 10:35:58 worker-0.rishabhcluster.xxxxx ovs-vswitchd[1105]:
ovs|00040|bridge|INFO|bridge br0: added interface tun0 on port 2#POD --------------------------------------------------------$ cid=`crictl ps | grep test-keycloak | awk '{print $1}'`
$ pid=`crictl inspect $cid | grep -w pid | grep -v type | tr "," " " | awk '{print $2}'`
$ netnsid=`ip netns identify $pid`
$ vethid=`ip link show | grep -B1 $netnsid | grep veth | awk -F"@" '{print $1}' | awk '{print $2}'`# journalctl | grep $vethid | grep br0Feb 27 09:47:56 worker-0.rishabhcluster.xxxxx ovs-vswitchd[1105]:
ovs|06933|bridge|INFO|bridge br0: added interface veth78bf8546 on port 995
Packet Flow :
We now have some background on the different devices involved in Openshift SDN CNI. The Openshift SDN documentation gives good overview of the packet flow between the different devices:
Now suppose first that container A is on the local host and container B is also on the local host. Then the flow of packets from container A to container B is as follows:
eth0 (in A’s netns) → vethA → br0 → vethB → eth0 (in B’s netns)
Next, suppose instead that container A is on the local host and container B is on a remote host on the cluster network. Then the flow of packets from container A to container B is as follows:
eth0 (in A’s netns) → vethA → br0 → vxlan0 → network [1] → vxlan0 → br0 → vethB → eth0 (in B’s netns)
Finally, if container A connects to an external host, the traffic looks like:
eth0 (in A’s netns) → vethA → br0 → tun0 → (NAT) → eth0 (physical device) → Internet
kube-proxy
kube-proxy is the Kubernetes network proxy running on each node of the Openshift cluster and it is managed by Cluster Network Operator. It is the kube-proxy responsibility to manage rules which help forward connections to endpoints referenced by the Service.
Creating the Service
Lets now create a Service exposing port 8080 of Red Hat Build of Keycloak. In previous article we started the Red Hat Build of Keycloak pods and now with service exposed we will be able to reach Keycloak pod on port 8080 from within the Openshift Cluster.
For creating the keycloak service we can utilize a sample Template with only service object.
$ cat keycloak-service.yaml
kind: Template
apiVersion: template.openshift.io/v1
metadata:
name: keycloak
annotations:
description: An example template for trying out Keycloak on OpenShift
iconClass: icon-sso
openshift.io/display-name: Keycloak
tags: keycloak
version: 22.0.7
objects:
- apiVersion: v1
kind: Service
metadata:
annotations:
description: The web server's http port.
labels:
application: '${APPLICATION_NAME}'
name: '${APPLICATION_NAME}'
spec:
ports:
- port: 8080
targetPort: 8080
selector:
deploymentConfig: '${APPLICATION_NAME}'
parameters:
- name: APPLICATION_NAME
displayName: Application Name
description: The name for the application.
value: keycloak
required: true
#Creating the Service
$ oc process -f keycloak-service.yaml \
-p APPLICATION_NAME=test-keycloak | oc create -f -
service/test-keycloak createdAs soon as the Service is created — we will have a endpoint created. The endpoint is collection of Pod IP with exposed port.
$ oc get endpoints
NAME ENDPOINTS AGE
test-keycloak 10.128.3.240:8080,10.129.3.57:8080 6sTo check Pod IP addres use -o wide
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-keycloak-1-9bk6q 1/1 Running 0 33m 10.128.3.240 worker-0.rishabhcluster.xxxxxx <none> <none>
test-keycloak-1-deploy 0/1 Completed 0 34m 10.128.3.239 worker-0.rishabhcluster.xxxxxx <none> <none>
test-keycloak-1-zfk2g 1/1 Running 0 6m46s 10.129.3.57 worker-1.rishabhcluster.xxxxxx <none> <none>The Service will be allocated with a IP address from serviceNetwork . This Cluster IP will be reachable from within Openshift cluster and it abstracts all the running Pod behind itself.
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-keycloak ClusterIP 172.30.133.86 <none> 8080/TCP 2m36sThe abstraction to hide all Pod IP address with single Service Cluster IP is managed using iptable rules. In next section we will see the list of iptable rules created as part of Service creation.
iptable rules
Now that we have our Service created, we need to review the flow when a Pod or a host node from Host network (worker, master) tries to connect with the Service. These are the only 2 use case because the Cluster IP of service is accessible only from within the Openshift Cluster.
Use Case 1 : Pod tries to connect with the Service
In case any Pod tries to connect with Cluster IP of the service, the packet is intercepted by PREROUTING chain.
iptable chain is traversed linearly until a terminating target is reached. So the packet moves to KUBE-SERVICES. The KUBE-SERVICES will have a chain for all the services created. We will check only for test-keycloak service that we have created.
KUBE-SVC-HX33NKLYYNUE7FCK is the chain created specifically for test-keycloak service.
- If we go inside this chain we will see
KUBE-MARK-MASQand 2 chains — each for 1 pod. Since the request is coming from a Pod (pod network — tun0) — theKUBE-MARK-MASQwill not be traversed. - Out of the 2 chain one chain is selected with random probability of 0.5. This is the iptable way of load balancing the request.
Let’s suppose KUBE-SEP-EMBGBZCSK4JZLXFE is selected. If we go inside KUBE-SEP-EMBGBZCSK4JZLXFE chain we will see another masquerade chain KUBE-MARK-MASQ .
- This time
KUBE-MARK-MASQchecks if the request is from the same Pod as to the Pod IP which is selected by the service and if yes the packet is marked later masquerade. This is done to prevent Pod from bypassing the Service request flow. - Ultimately the packet is
DNATto the Pod IP endpoint (10.128.3.240:8080 in this case).
After Destination NAT to specific Pod IP — the request follows the CNI path through vxlan to reach the worker node where the Pod is hosted and pass through the bridge, to the veth endpoint and ultimately to Pod.
Use Case 2: Host node tries to connect with the Service
Lets now see the scenario where the request to Service Cluster IP is sent from host node.
In this scenario the OUTPUT chain is traversed first before moving to POSTROUTING chain.
Again we will first go to KUBE-SERVICES chain where we will look for test-keycloak service
This time the request is coming not from pod network and hence the KUBE-MARK-MASQ is traversed to mark the packet for later masquerade in POSTROUTING chain.
Next like before the packet will be DNAT’ed to a specific Pod IP — depending on the chain selected. Let’s suppose this time the packet will be DNAT to 10.129.3.57(test-keycloak-1-zfk2g pod ).
The packet will then move to POSTROUTING chain.
Since the packet is marked with 0x1 in previous step, the packet will return from OPENSHIFT-MASQUERADE
In KUBE-POSTROUTING the packet will be returned if the mark does not match 0x1 . Since the mark is there the packet will reach to MASQUERADE where the host node IP is SNAT’ed to veth interface IP. Also a conntrack entry is kept for this request to remove the SNAT in reply.
Conclusion:
This article completes my attempt to understand the functions involved in creating a Pod and then making it accessible within the Openshift cluster.
Previous articles:
Spawning Container with runc: https://medium.com/@rishabhsvats/plumbing-of-spawning-container-with-runc-ed409ac02ae3
Behind the Scene : Applying DeploymentConfig to Openshift Cluster : https://medium.com/@rishabhsvats/behind-the-scene-applying-deploymentconfig-to-openshift-cluster-efe1d081c432
