DA Kube — Selenium Grid using Kubernetes, Docker, Helm and Traefik

Next Gen Distributed Automation for microservices and shift-left use cases

Ambighananthan Ragavan
Expedia Group Technology
10 min readNov 22, 2019

--

Photo by Ricardo Gomez Angel on Unsplash

When your applications have a reasonable number of UI (User Interface) tests, it is inevitable that you’ll want to have a solution to execute them concurrently, in order to achieve a faster feedback cycle. I wrote in this post how we achieved Distributed Automation (DA) using SeleniumGridScaler open source project, running on AWS and it has been running fine for the past 5 years. With 100+ hubs and running around 140k+ tests a day, it has met our needs so far.

But we are seeing shortfalls in the system as we move to a microservice world and towards a shift-left approach for testing. (Testing earlier in the development cycle, even before committing the code, to find and fix problems early.)

New challenges bring new use cases and new use cases require new technologies.

Problem with current EC2 based SeleniumGridScaler solution:

We see four main problems with the current solution:

  1. The current solution is EC2 based as opposed to container based. The Selenium hub instance has to be manually created, for the hub to scale up nodes takes around 2–4mins. The hub is static, and to save on cost we have to stop it when not in use, then start it before we start our automation. This adds time to our feedback cycle.
  2. We want to shift-left, speed up the feedback cycle, and have dynamic hub creation.
  3. Multiple AWS accounts with shared CI/CD infrastructure and private IP addresses can overlap across accounts.
  4. With EC2, CPU/memory can be fined tuned only at the instance level, hence some parameters are overly allocated and we’re paying for resources we don’t use.

We’re addressing all four problems by moving to a container-based design. When Kubernetes(EKS), Docker, Helm and Traefik are combined, it brings a comprehensive solution addressing all these problems and more.

I’m assuming here that you know the basics of Docker and Kubernetes (k8s) concepts, and I’ll skip going into that. (You can consult the Docker or Kubernetes documentation for an overview.)

Why do we need Traefik?

Before going into hub and node creation, let me explain why we need Traefik.

There are different ways to get external traffic into your EKS cluster, including using ClusterIP, NodePort, LoadBalancer, or Ingress. Out of these, we identified Ingress as the standout way to address our needs. Traefik is an Ingress and it acts as a smart router by sitting in front of multiple services.

In Kubernetes, a service manages a group of Pods.

As you can see here in the diagram below, the Ingress rule (hard coded here), directs that any traffic hitting Traefik end point hubdemo3.hub.test.expedia.com/ should be routed to the service name hubdemo3-selenium-hub. Traefik, being a reverse proxy, can do this very well. You can have any number of hubs and Traefik can make sure your automation runs against the correct Selenium Grid instance.

How Traefik routes the incoming traffic

Restricting traffic to internal

NLB (Network Load Balancer) used by Traefik might be public facing, you don’t want your grid to be exposed to the outside world. Use loadBalancerSourceRanges to restrict traffic to internal only.

metadata:
name: traefik-ingress-service
namespace: kube-system
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: {{ .Values.loadbalancer.tags}}
spec:
loadBalancerSourceRanges:
- "x.x.x.x/32",
- "y.y.y.y/28",

Using Helm to deploy

Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of Kubernetes resources. A single chart might be used to deploy something simple, like a Selenium Grid i.e., hub and nodes.

I use Selenium helm chart to deploy the hub and nodes.

Creating a DA-Kube Selenium Grid

helm install --set chrome.enable=true selenium --name hubdemo3 --kubeconfig=config/test/kubeconfig

  • This command will create you a grid with a hub and one Pod (runs one Chrome browser)

Scaling your DA-Kube Selenium Grid

helm upgrade --set chrome.replicas=50 hubdemo3 selenium --kubeconfig=config/test/kubeconfig

  • This command scales the number of Chrome Pods attached to your hub up to 50, counting the one you already created in the previous command

Purging your DA-Kube Selenium Grid

helm delete hubdemo3 --purge --kubeconfig=config/test/kubeconfig

  • This will destroy your grid

Creating DA-Kube grid with a particular browser version

helm install --set chrome.enable=true --set chrome.tag=3.141.59-vanadium selenium --name hubdemo3 --kubeconfig=config/test/kubeconfig

Here passing 3.141.59-vanadium will create Chrome browser 77.0.3865.75. You can get the chrome docker image tags from public docker hub.

The idea here is you can create as many grid setups you need, on demand, with any browser versions of Chrome or Firefox, use them, and destroy them.

Imagine the CI/CD pipeline picture below of a microservice app, where many developers are working on multiple branches and they want to deploy their branches and run automation concurrently and independently from each other.

CI/CD pipeline of a sample project and how DA-Kube will fit in the picture for different use cases

It is not practical to use one instance of Selenium Grid to meet this use case. DA-Kube can meet all the use cases here because each deployed branch can run its automation on its own grid and each grid can run different browsers or browser versions as well.

Horizontal Pod Level Scaling

This gif shows how the overall mechanism of scaling up and down of worker nodes and Pods work in k8s.

How EKS cluster worker nodes and Pods scale up and down!

Warm Worker Node Pool

Just because this is a container based solution, grid creation is not guaranteed to be faster unless the worker node pool (pure EC2 instances on which Kube Pods are running) is “warm”. When many teams are using the EKS cluster, it will be warm most of the time. You can also increase the minimum level of nodes to be always running and optimize it.

HTTPS based Selenium Grid

One thing you would have noticed here is the Selenium Grid now uses https instead of http. You may need to change your automation framework capabilities to support this.

NightwatchJS Example

"chrome_dakube_grid" : {
"launch_url" : “http://www.expedia.com”,
"selenium_host" : "hubdemo3.hub.test.yourcompany.com",
"selenium_port" : 443,
"use_ssl": true,
"desiredCapabilities": {
"browserName": "chrome",
"javascriptEnabled": true,
"acceptSslCerts": true,
"chromeOptions": {
"args": ["--no-sandbox", "disable-web-security"]
}
}
}

and add these commands as well

export NODE_TLS_REJECT_UNAUTHORIZED=0
npm config set strict-ssl false

You can see there is no selenium_port 4444 but we are using 443, use_ssl is true.

Reliable Selenium Grid

To make sure your Selenium Grid is reliable and always available, make use of the QoS feature of k8s. K8s has three different classes of Quality of Service.

  1. Guaranteed
  2. Burstable
  3. Best-Effort

Only Guaranteed QoS class can reduce the chance of your Pods being evicted in times of resource shortage. To make your Pods get guaranteed QoS, you must follow this rule:

  1. Every Container in the Pod must have a memory limit and a memory request, and they must be the same.
  2. Every Container in the Pod must have a CPU limit and a CPU request, and they must be the same.

If your Pods spec follow those two rules, k8s automatically sets the QoS to Guaranteed.

Example: The following configuration sets both the limit and requests values of a Pod to be exactly the same, so that criteria 1 and 2 above are met. This Pod will get Guaranteed class QoS.

## Configure resource requests and limits
## ref: http://kubernetes.io/docs/user-guide/compute-resources/
resources:
requests:
memory: "750Mi"
cpu: "225m"
limits:
memory: "750Mi"
cpu: "225m"

You can see all the other values here

Running one browser per Pod

In k8s, it is possible to run many containers in one Pod, which means you could run multiple instances of a Chrome container within one Pod. But running just one Chrome container per Pod adds a lot to the reliability of your setup. There will be no difference in cost or performance.

Hub and nodes in different subnet

EKS cluster creation requires that you use a minimum of two subnets. In this case, it is plausible to have a case where your hub is in one subnet and its nodes are in other subnet. This shouldn’t be an issue but if you get too many FORWARDING_TO_NODE_FAILED errors, this may be one of the reasons. I have seen this in the old pure EC2 DA implementation but not with DA-Kube.

Choosing the right instance type

A pure EC2 instance type c5.xlarge can run 10 Firefox or Chrome instances, but in Kubernetes it can only run 8 because in every k8s worker node, a minimum of two other daemon Pods (kube-proxy and kubelet) will be running — hence the total resources available to run browser containers will be reduced.

A c5.2xlarge instance can run 19 browser Pods.

Headless vs Regular

There is no advantage in running a browser in headless mode in Selenium Grid. There is a common misunderstanding that running “headless” is something special. The only advantage of “headless” is when you run automation locally in your laptop — you won’t be distracted by browser instances opening and closing, and you won’t need an xvfb process.

Private IP address bleeding

K8s reserves a certain number of private IP addresses per worker node instance type, per interface. There is a challenge associated with that reservation policy. For example, on a c5.2xlarge instance, only 19 Chrome or Firefox browser Pods can be run. But according to the spec, a c5.2xlarge instance can have 15 IP per interface and if there are two interfaces, 30 private IP address will be reserved. That is around 9 IP addresses more than I need. When we scale 300–500 Selenium node Pods, this can easily result in IP address bleeding, resulting in not enough private IP addresses, and hence Pod creation will fail.

500 browser Pods require, 500/19 = 26 ec2 instances of c5.2xlarge. With around 9 extra IP reserved for all nodes, we will waste 234 private IP addresses. This is just a rough calculations.

Workaround is to use amazon-vpc-cni-k8s to limit the IP address assigned to an instance type.

- name: WARM_IP_TARGET
value: "21"

kubectl apply -f amazon-vpc-cni-k8s/aws-k8s-cni.yaml — kubeconfig=config/test/kubeconfig

Why not use third party cross browser vendors to run all automation?

Third party cross browser vendors charge based on the number of parallel connections. It could cost around $350k for 120 parallel connections depending on the vendor, type of service, real vs. simulated devices, etc. But if your company has lots of UI based microservice apps and each have got their own CI/CD pipeline, then 120 parallel connections will not be enough to run all the tests of all the pipelines if and when they happen to run at the same time.

Then you might need 1000 parallel connections from cross browser vendors and it would cost around $2.41 million. With our Distributed Automation system, we only pay for what we use, for a cost of about $80,000 annually . The chart below shows the benefit of using an internal solution like DA-Kube for all automation runs and use third party cross browser license only for running a subset of tests for cross browser coverage. This means you will need to pay only for few parallel connections, this depends on your organization size as well.

Third party Cross Browser providers vs DA
Cost of boxes in Data Centre

Scale the test infrastructure

One might have the best distributed automation solution in the world, but if the test infrastructure is not scaled to meet the throughput of the concurrent automation run, its utility is reduced because you’re not meeting the goal of minimizing testing time. You must scale up the test infrastructure size appropriately.

Reduce cost using spot instance in EKS

This is something I have not tried but it is possible to reduce the cost of an EKS cluster using a mixture of on-demand and spot instances. A spot instance can be terminated by AWS giving only 2mins window. K8s has features like Node Affinity, Taints, and Toleration with NodeSelector to help run your Grid on both on-demand and spot instances in a more reliable and cost effective way.

To create a solution, it is important to identify all the problems at hand and link it to your vision. DA-Kube design has so far met the expectation of all the challenges posed by the next generation use cases!

I have presented this in Selenium Camp Conference, Kiev in Feb ’19.

--

--