How does EKS select subnets for Service LoadBalancers?

Hallblazzar

Published in

Hallblazzar ：Developer Journal

5 min readSep 12, 2020

About this topic

What information are included?

Analyzing behavior of LoadBalancer type Service under EKS.
How do I analyzing that via different approaches, i.e., black-box testing and code tracing.

What information are not included?

Basic of AWS EKS and ELB. If you’d like to know that, please refer to AWS official documents and samples. Otherwise, if you buy AWS Support plans, you could look for AWS BD/TAM/SA’s assistance, or you could also consider create a support cases for guidance(may be I’ll be the one provide you assistance in the case 🤣 ).
Basic of K8s.If you’d like to know that, please consider refer to Kubernetes official document.

Question definition

Recently, I handled an interesting case about the behavior of LoadBalancer type Service under EKS:

Why EKS could place Networker Load Balancer(NLB)/Classic Load Balancer(CLB) created by LoadBalancer type Service on subnets without tags?

What is the question asking? To understand what question mentions, first of all, we need to know the purpose of Kubernetes(K8s) Service. Under the Kubernetes design, Service is one of the most common ways users could expose their containerized applications run in Pods to the internet. There are 2 types of Service could be defined for the purpose:

NodePort: Exposes the Service on each Node's IP at a static port (the NodePort). A ClusterIP Service, to which the NodePort Service routes, is automatically created. You'll be able to contact the NodePort Service, from outside the cluster, by requesting <NodeIP>:<NodePort>.
LoadBalancer: Exposes the Service externally using a cloud provider's load balancer. NodePort and ClusterIP Services, to which the external load balancer routes, are automatically created.

Typically, for users use cloud provider hosted Kubernetes solutions(e.g., AWS EKS, GCP GKE or Azure AKS), LoadBalancer type Service is the most commonly used one. In general, on cloud provider hosted environment, LoadBalancer type Service will construct load balancer solutions hosted by cloud providers, such as AWS ELB, GCP Network Load Balancing or Azure Load Balancer.

Back to the question in the case, apparently, user is confused by the behavior of how load balancers(AWS NLB/CLB) created. As AWS official document mentions, ELBs will be created on subnets based on subnet tags:

For internet-facing ELB, subnets with the tag, kubernetes.io/role/elb, are required. Internet-facing ELB will be placed on these subnets.
For internal ELB, subnets with the tag,kubernetes.io/role/internal-elb, are required.

From the description above, it looks subnets with correct tag is required to created ELB. However, after testing, I could confirm the situation the customer mentions — without subnet tagging, subnets could still be used to create ELB. Though I could find the following description in an insignificant notation in a AWS knowledge center topic, which shows route table is also an important factor in subnet selection process, I still unable to verify how do the whole mechanism works:

Note: If you don’t use the preceding tags, then Cloud Controller Manager determines if a subnet is public or private by examining the route table associated with that subnet. Unlike private subnets, public subnets use an internet gateway to get a direct route to the internet.

But how could I verify that?

Analyze Approach 1. Black-Box Testing

To verify behavior of an unknown system, the most straightforward way is deploying it. Besides, to prevent unexpected side effects, doing verification on a newly created EKS cluster will be the most safest approach. My test steps are as below shows:

Deploy a brand new EKS cluster with cluster definition below(using eksctl create cluster -f to deploy it, and default-key should be replaced with your EC2 key-pair)

2. Create NLB/CLB based on LoadBalancer type Service definition below:

3. Use different variable combinations:

If you do the same test, then after verification, you would have the same finding as me:

Actually, for each condition, you could also verify it as below:

However, the question is, are the summary I make reliable? Are they really the truth?

Analyze Approach 2. Tracing Source Code

As the hosted K8s solution, if I’d like to track EKS control plane and Service controller’s behavior, in addition to back-boxing testing, tracing their source code will be more better choice — after all, code won’t tell lies. Thanks to power of open source today, if I’d like to trace code of an open source project, everything I need could be found on Google. For K8s control plane and Service controller, source code are stored in K8s GitHub repository.

But it still took me bunch of time to find what I need. The reason is, I’m not familiar with structure of the project. For that reason, I spent most of time on using different keyword to search codes in the repository, such as “AWS”, “ELB” or “NLB”. Finally, I figured out how to find what I need under the project:

kubernetes/staging/src/k8s.io/legacy-cloud-providers/ - For cloud provider specific component implementations, e.g. Controllers, could be found inthis repository.
cloud-provider-aws repository — Though I don’t know when did the decision made, it looks currently, for cloud provider specific component implementations, source code are isolated from Kubernetes repository. Under the situation, each cloud provider maintains component implementations source code in their own repository under the Kubernetes project. For AWS specific component implementations, source code are store in this repository. Besides, from its go.mod file, I could confirm that it still refers to code store in kubernetes/staging/src/k8s.io/legacy-cloud-providers/ - just like the README.md of kubernetes/staging/src/k8s.io/legacy-cloud-providers/ mentions, "Out-of-tree cloud providers can consume packages in this repo to support legacy implementations of their Kubernetes cloud provider."

After clarifying that, I could easily tracing the behavior of LoadBalancer type Service in kubernetes/staging/src/k8s.io/legacy-cloud-providers/, and I summarize them as the psuedocode below:

The ELB creation entrypoint is in the function, EnsureLoadBalancer. It searches subnets and performs ELB creation. And the main logic of subnet selection could be found in the function, findELBSubnets. The rest of details could be seen on the psuedocode above. Besides, because function/variable names above are defined according to the real function/variable name in source code, if you'd like to trace the actual codes in repository, you could use function/variable name and code link in comment to figure them out.

In addition, the main difference between my findings and code are:

All subnets with EKS subnet tag requirement, kubernetes.io/cluster/<cluster-name>, will be candidates.
Subnets satisfies route table and tag requirement will be selected, but the ones satisfies tag requirement have higher priority.
If multiple subnets are selected for single AZ, the one with higher lexicographic order will be selected.

It’s interesting, isn’t it? Though it’s really a tough work, I still love the progress of searching, testing, verifying and reading codes!

Hope this article is helpful for you if you’re interested in this subject. If you have questions about it, or if you’d like to further discuss about EKS and ELB, please feel free to let me know. I’d love to talk about that with you.😎