Journey to 1,000 nodes for IBM Cloud Private
--
Kubernetes is becoming more mature and the size of a single Kubernetes cluster is growing from hundreds of nodes to thousands of nodes.
This article documents the enablement of IBM Cloud Private, which is based on Kubernetes, to support 1,000 nodes in one single cluster.
Test Environment
IBM Cloud Private 2.1.0.2 (Kubernetes 1.9.1), released in March 2018, was used to test 500 nodes in one single cluster.
Four types of nodes were used in the IBM Cloud Private scalability cluster:
- Master Node: This type of node uses processes such as resource allocation and state maintenance to control worker nodes in a cluster. Master nodes primarily run Kubernetes core services such as apiserver, controller manager and scheduler. They also run lightweight services such as auth service and catalog service.
- Management Node: This type of node is optional. It hosts management services such as monitoring, metering, and logging. When you implement management nodes, you help prevent the master node from becoming overloaded.
- Proxy Node: This type of node is primarily used to run the ingress controller. Use of a proxy node enables you to access services inside IBM Cloud Private from outside of the cluster.
- Worker Node: This type of node works as a Kubernetes agent that provides an environment for running user applications in a container.
The environment included implementation of the following:
- Container networks managed by Calico
- Calico used Node-Node-Mesh to manage Border Gateway Protocol (BGP) routers
- Calico version 2.6.6 was installed and used etcd Version 2
- One etcd cluster was shared by both Kubernetes and Calico.
The following topology provides a visual representation of the various components of the 500-node cluster:
All functions in IBM Cloud Private worked well with 500 nodes in one IBM Cloud Private cluster.
Issues related to support of 1,000-node cluster
The following issues arose when attempting to scale the cluster to 1,000 nodes:
- Node-to-node mesh stopped working when there were more than 700 nodes in the cluster.
- In a cluster with 1,000 nodes, the node-to-node mesh number would be 1,000, which is too large. This results in failure to start the Calico node.
- etcd load is very high when scaled up to 1,000 nodes. Kubernetes APIServer is not responding.
- 1,000 Calico nodes result in many reads and writes for the shared etcd.
- After deleting Calico, etcd load returned to normal.
Solutions for scaling up to 1,000 nodes
Do not use Calico node-to-node mesh in large clusters
Test results from the 1,000-node scenario prompted investigation regarding node-to-node mesh. The Calico community suggested that node-to-node mesh can be used with clusters of less than 200 nodes. However, clusters containing more than 200 nodes should use Router Reflector mode. Each Router Reflector can manage a group of Calico nodes so there will be no mesh connections between each Calico node. Based on testing, one Router Reflector can be used by 1,000 nodes. There is ongoing testing to see if one Router Reflector can be used to support 2000 or more nodes.
For many Calico deployments, the use of a Route Reflector is not required. However, for large scale deployments a full mesh of BGP peerings between each of your Calico nodes will become untenable. In this case, Route Reflector allows you to remove the full mesh and scale-up the size of the cluster.
The screenshot below is a discussion with @projectcalico on Twitter:
You may be concerned about the performance of node-to-node mesh and Router Reflectors. When tested, there is no performance difference between node-to-node mesh and Router Reflectors. The major difference between the two modes is in management effort. The only negative effect of using Router Reflectors is that you must run and manage them. The majority of users, especially small clusters, will benefit from the simplification of using node-to-node mesh instead of requiring a Router Reflector.
Upgrading Calico from V2.6.6 to V 3.0.4
The Calico Version 3 changelog indicates that Calico Version 3 supports the etcd Version 3 API. When you use the etcd Version 3 API, applications use the new gRPC API Version 3 to access the Multi-Version Concurrency Control (MVCC) store which provides more features and improved performance. See the etcd documentation for more information about the etcd Version 3 API.
Testing was done to compare the performance of Calico Version 2.6.6 with etcd Version 2 API (Table 1) and Calico Version 3.0.4 with etcd Version 3 API (Table 2).
The test results showed improved performance using Calico Version 3.0.4. Calico Version 3.0.4 latency is more than 2 times less than with Calico Version 2. In addition, the improvement of queries per second (QPS) with Calico Version 3.0.4 is more than 2 times the rate with Calico Version 2.
Separate etcd for Calico and Kubernetes (optional)
Test results show that the shared etcd used by Kubernetes and Calico creates very high loads in a large cluster. The intention was to enable Calico with a separate etcd, which is also a best practice proposed by the Calico community.
However, after upgrading from Calico Version 2.6.6 to Version 3.0.4, the etcd that was dedicated to Calico was not experiencing high loads from Router Reflector indicating that it is not necessary to use a separate etcd for Calico.
During the test, 20,000 Liberty pods were created in the 1,000-node cluster, which means that each worker node was running 20 pods.
The following test results are based on Calico V3.0.4 with 1,000 nodes and Router Reflector with a separate etcd for Calico.
top value for Kubernetes APIServer (on leader master) without workload
top - 02:41:23 up 6 days, 23:27, 12 users, load average: 8.03, 7.26, 8.21
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 18.2 us, 4.5 sy, 0.0 ni, 76.4 id, 0.4 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem : 13203342+total, 67974048 free, 13182128 used, 50877248 buff/cache
KiB Swap: 13421568+total, 13421446+free, 1208 used. 11707584+avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3716 root 20 0 21.661g 6.045g 73296 S 108.3 4.8 3117:20 hyperkube
top value for APIServer (on leader master) with workload
top - 04:47:52 up 7 days, 1:34, 12 users, load average: 64.92, 68.68, 66.98
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 62.9 us, 21.3 sy, 0.0 ni, 11.2 id, 0.2 wa, 0.0 hi, 4.4 si, 0.0 st
KiB Mem : 13203342+total, 36378248 free, 42160692 used, 53494488 buff/cache
KiB Swap: 13421568+total, 13421446+free, 1208 used. 88023200 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3716 root 20 0 31.356g 0.026t 73552 S 1923 21.3 5165:31 hyperkube
etcd memory load change for Kubernetes
The memory graph above indicates that when workload is submitted to IBM Cloud Private, the etcd memory increases due to frequent write and read operations. However, after all workloads are running, the memory returns to a stable value.
etcd DB size change for Kubernetes
When 20,000 pods were started in IBM Cloud Private, the DB size for Kubernetes etcd was increased from 1.3G to approximately 1.45G.
etcd memory load change for Calico Router Reflector
The chart above shows that even when 20,000 pods were submitted in the IBM Cloud Private cluster, the etcd memory for the Calico Router Reflectory used only 500M of memory.
etcd DB size change for Calico Router Reflector
Calico Router Reflector DB size was increased by approximately 40M with 20,000 pods.
The test above indicates that the Calico Router Reflector does not contribute too much load to etcd, so you can use shared etcd for Kubernetes and Calico with Router Reflector. However, it is recommended that you use a separate etcd for Calico to be sure that Calico does not impact the Kubernetes APIServer when using a shared etcd.
Topology Change
You can easily extend IBM Cloud Private to support 1,000+ nodes in one cluster.
The following two diagrams illustrate the major topology changes needed to support 1,000+ nodes with IBM Cloud Private.
The changes are:
- Use Calico Router Reflector for large scale clusters.
- Upgrade Calico from Version 2 to Version 3 so as to leverage etcd Version 3 API for better performance.
- Separating etcd for Calico is recommended for production.
Old Topology
New Topology
Summary
Kubernetes currently claims support of 5,000 nodes in one cluster. However, different configurations in the Kubernetes cluster, such as different network management technology and different deployment topology may limit the cluster size of Kubernetes. More performance and deployment topology tuning is yet to be done for large scale clusters.
IBM Cloud Private 2.1.0.3 (Kubernetes 1.10.0) supports 1,000+ nodes in one cluster with configurations discussed in this article. A new Medium blog with details on setting up IBM Cloud Private cluster with 1,000+ nodes in one cluster is coming soon.