Cost optimised DevOps cluster using Kubernetes and ASGs

Published in

dtlpub

4 min readSep 19, 2019

Why we went with Kubernetes for DTL Ops?

The DTL Ops journey started a few years back, with a single EC2 server to run Jenkins 2 and a few other Jenkins pipeline projects. Within a year, the number in the pipeline increased, and we wanted to find a way to avoid being bottle-necked at CI/CD, due to the static number of Jenkins executors.

Our initial thought was to include another EC2 server or two, as a Jenkins slave, but then we faced the question of will it be enough? Will the number of pipelines increase? Do we need to manually scale up? The answer was a Kubernetes (k8s) cluster, along with Jenkins X, which supports running on Kubernetes (as well as other cool features which I will not focus on in this post).

Kubernetes allows us to run Ops tasks and monitoring using Prometheus and Grafana all in the cluster. It also allows more development support utility services, such as an Apicurio API documentation generator, with resources to spare for future app deployments.

What is Kubernetes and why did we choose it?

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.

As stated above, k8s is for orchestrating and managing our containerised apps. k8s is a generic framework, which is cloud vendor-agnostic. It was originally developed by Google for its container orchestration in production, plus Google is well known for running production at a large-scale. Those two reasons were enough for us to proceed with k8s over other alternatives.

For more information on Kubernetes, G2 has a list of k8s alternatives and competitors.

Initial OPS k8s cluster

Our first k8s cluster consisted of 1 master node and 2 worker nodes which were of t2.medium and 2 t2.large EC2 instances respectively. This cluster was more than enough to cater to our Ops and other service deployment needs at the time. Then we realised AWS offers EC2 spot instances where you can save up to 90% of the EC2 cost against on-demand EC2 instances. We wanted to have the ability to use the multiple price options and advanced options that AWS ASGs provided.

apiVersion: kops/v1alpha2kind: InstanceGroupmetadata:creationTimestamp: 2019–07–17T12:23:53Zlabels:  kops.k8s.io/cluster: ops.domain  name: nodesspec:  cloudLabels:    name: ops.k8s.nodes.ec2    project: internal    client: dtl    environment: ops  image: kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019–05–13  machineType: t2.large  maxSize: 2  minSize: 2  nodeLabels:    kops.k8s.io/instancegroup: nodes  role: Node  subnets:  - ap-southeast-2a  - ap-southeast-2b

Use of Mixed ASGs

It is widely known that tools like Kops (such as in our case) use instance groups to match EC2 Autoscaling Groups (ASGs) in AWS. With Mixed ASG support OOTB, ASG was perfect for us to deploy our k8s cluster worker nodes group, with 100% spot instances. Knowing, mixing, on-demand instances were just a configuration away if we wanted them in the future was reassuring.

apiVersion: kops/v1alpha2kind: InstanceGroupmetadata:creationTimestamp: 2019–07–17T12:23:53Zlabels:  kops.k8s.io/cluster: ops.domain  name: nodesspec:  image: kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019–05–13  cloudLabels:    name: ops.k8s.nodes.ec2    environment: ops    client: dtl    project: internal  machineType: m4.large  maxPrice: “0.05”  maxSize: 2  minSize: 2  mixedInstancesPolicy:    instances:    - m5.large    - c5.large    - t3.large    - t2.large    onDemandAboveBase: 0    spotInstancePools: 4  nodeLabels:    kops.k8s.io/instancegroup: nodes  role: Node  subnets:  - ap-southeast-2a  - ap-southeast-2b

An interesting setting worth highlighting was the use of instance array and spot instance pool. We went with 4 for our spot instance pool for the most diversity of EC2 instance types and instance array most likely available EC2 instances. But of course, everyone can set the on-demand base and on-demand above base percentage based on their needs.

What next?

We have not added any cluster-autoscaler to increase the nodes based on the pods resource requirements, however, we are planning it as our next task. The cluster- autoscale will make our Ops k8s cluster more robust and resilient.

There is a limitation worth mentioning when using EC2 spot instance fleet 100%. The limitation occurs when you need to handle the instance lifecycle event, where AWS request the instances to cater their on-demand instance requests. Fortunately, there are a few ways of doing this and we are in the move to use spot instance termination handler.

Conclusion

Containerized applications are great for utilizing most of the VM resources and k8s is the leading framework for managing those containers. With AWS offering EC2 spot instances at a fraction of the cost to on-demand instances. There are a few management overheads, especially when it comes to spot-instance termination notice, but fortunately, there are tools out there to handle these situations seamlessly.

Reference: The definitive guide to running EC2 Spot Instances as Kubernetes worker nodes