GKE Operations on Google Cloud
In lots of scenarios, a lot of people get overwhelmed by how to manage their GKE environments. Should we have a single cluster OR do we run multiple clusters? Which VPC should I place my cluster — service or host? How do I consider billing? What about permissions and privileges? Oh Security! I will be addressing most of these questions but this article will look specifically at general operations practices for managing GKE in the cloud, with focus on GCP.
When customers build out their GKE clusters, they have to decide on sizing and this affects IP address management, scaling and overall management. This eventually affects billing and possible charge backs within the organization — who is paying for what and how much should they be paying? This is particularly helpful when customers are moving to the cloud and want to estimate their expected spend
Default values for Kubernetes variables including number of pods per node, number of services in an environment, number of nodes per cluster could be too large that for every new cluster created, could result in a lot of wasted IPs. There are solutions to expand the IP ranges that are usable in GKE including RFC 6598 and RFC5735 using the Kubernetes IP masquerade feature.
T-shirt sizing of GKE clusters could be a great solution for some of these concerns. By creating multiple T-Shirt sized GKE clusters, organizations can flexibly manage their environments and conserve IP address space. Organizations can also enjoy predictable cost, prevent knee-jerk optimization processes and enhance their automation process.
Using the following as metrics, administrators can define small, medium and large environments. Not one size fits all customers, so administrators should feel free to use XL, XXL and others as decided. With care being taken not to complicate the environment as well, considerations should include machine types as well.
Understand the Business Needs
Machine types are important depending on the workload type. Video and ML workloads requiring GPUs should be considered with different sizing from general compute workloads. If your environment does not yet need this level of compute power, then that’s ok. Move on to the next step.
Understand the Application
Workload size is mostly dependent on the application. How much compute power and memory would you need for your deployment, statefulsets or deployment, persistent disk or not. The workload type would help define what size of GKE cluster to define. If an environment requires a database application, for example MongoDB, this environment could qualify for medium sized GKE cluster. Since persistent volumes can be dynamically created, you do not need to change the GKE cluster size for this, instead consider the persistent volume as an add-on when needed.
Traffic profile is another factor that can be used for sizing. Where traffic is expected to be large, for example an internet-facing application, then this can be used to potentially estimate the number of nodes that should be in a GKE cluster. GCP nodes for internal IP address destinations can reach up to 100Gbps throughput. However, the maximum possible egress from a single VM cannot exceed 7Gbps total for all egress flows. If you estimate average traffic flow to be at 10Gbps with a bandwidth tolerance threshold at 50%, then it is best to envisage a minimum GKE cluster size based on internet-facing traffic flow at about 4 nodes minimum.
More GKE-focused attributes guide the next factors that could be used for cluster sizing. Namespaces are used for resource isolation in GKE. If you have a large number of namespaces, consider reviewing the number of namespaces in a cluster. This is important where there are a large number of teams, and or projects. Managing namespaces also have ramifications for DNS names. Namespaces can help identify which environment is using the most addresses and can help with optimization opportunities.
Number of Services:
Number of services to be deployed, while it might not be obvious for some deployments can be estimated. Hence, ranges can be used here. Since the services are assigned IP addressing, it could be a good idea to identify the expected number of services in a namespace; and then separate the services to external facing and internal facing. External facing services would be assigned as load balancers, while internal facing services can be set as cluster IP or NodePorts. Consider using Ingress to expose multiple services with the same IP address where all these services all use HTTP.
Number of Pods
According to a 2020 Google article, “more than 95% of GKE clusters are created with no more than 30 pods per node”! The density of pods on nodes can be used to determine the address range for the pods on the node. Using this approach, a small environment can use a smaller subnet, and larger environments use larger subnets
All the above factors can be leveraged to identify the number of nodes that should be in a GKE cluster. Using auto-scaling, only the base number of nodes can be defined with the ability to scale up or down depending on need. Where there is need to change the scaling profile for that cluster, for example moving a GKE cluster from a small to medium cluster, the cluster update command can be used to change the profile.
By using node pools, the GKE clusters can be increased to accommodate additional compute, memory and storage capacity in order to support new workload requirements. New node pools can also accommodate new services, new pods and new IP address ranges.
With GKE Autopilot, you have an option to define the network subnets for the nodes, services and pods. Some of the above requirements could be ignored at the expense of more control.
Putting in the early work helps your environment to prepare for everyday operational challenges. Using T-shirt sizing reduces how your team responds to changes in the environment, and changes will happen. No more knee-jerk reactions and much more importantly, conservation of scarce IP address resources.
Other values include cost predictiveness, since workloads can land on a pre-defined set of upgradable compute sizes. Depending on the workload type, for example, a large temporary marketing initiative, required for a few weeks can be predicted during the budgeting cycle, instead of hand-to-mouth assumptions.
This approach also enables automation through forms, which can accelerate your team’s ability to accelerate the delivery of IT resources to the business. By filling out a form, with indications on traffic profile, application requirements and security level, faster decisions can be made on what the right compute sizing should be.