Running EKS in production for 2 years: the Kubernetes journey at HK01
Dick Tang, Director of DevOps at HK01
Moving to the Kubernetes platform is one of the key milestones achieved by the HK01 Site Reliability Engineering team (and also the entire HK01 tech team). It allows us to ship our products more rapidly and reduce time-to-market, by leveraging the modern building blocks from the cloud-native ecosystem.
Of course, it is not a free lunch — it requires an architectural change to both our codebase and also the cloud infrastructure design, training to the whole tech team (both developers and system engineers), as well as some troubleshooting on the production environment. But we overcame it, we did it.
As a result, we have 50+ microservices running on our production Kubernetes (AWS EKS) cluster, serving 8M+ audiences in the Greater China region; at the same time, keeping the Kubernetes cluster up and running for 99.99%+ in 2019.
Pre-Kubernetes Age
HK01.com was launched in 2016. As a media website, there are traffic spikes due to the breaking news. As our website grew, the tech team spent more and more time on the server load or debugging the monolithic codebase in the production environment, instead of shipping new features and products.
In response to these issues, the Site Reliability Engineering (SRE) team was formed in 2017. The major goal of the team is to keep hk01.com up and running reliably. In the past three years, the SRE team introduced several DevOps tools and related technology: CI/CD pipeline, also various infrastructure-as-code software, like Ansible, Terraform, and of course, Kubernetes, the lead character of this story.
Why Kubernetes?
Kubernetes is a container orchestration platform and benefits us on:
- Consistent but dynamic environment, from development to production
Developers sometimes complained about the inconsistency between the production environment and the SIT environment, let alone the local development environment is usually completely different from the others. Thanks to container technology, we can use the same container image from our laptop to our production server. The technology also allows developers to experiment with new tech stack with less effort. - Supporting flexible deployment in the fine-grained microservice world
Our previous tech stack (EC2 instance and AWS CodeDeploy) is not designed to manage deployments of hundreds of applications. Not only designed to deploy hundreds of microservices, Kubernetes but also provides some advanced deployment strategies like blue-green deployment. Also, Kubernetes allows a more dense deployment with fewer nodes, thus further cost-saving.
Kubernetes at the pilot stage
Back in 2017, Kubernetes had already shown its potential and was so popular among tech companies. We are one of those companies which considered introducing it. When we were revamping our system to micro-service architecture with our backend engineers, we also started to evaluate some deployment options available on AWS Cloud at that moment, including kops, kubespray, Tectonic (the commercial offering from CoreOS).
All of them are self-hosted solutions and require system engineers to maintain the Kubernetes cluster, including the core of Kubernetes, etcd, the consistent and highly-available key-value store used as Kubernetes’ backing store for all cluster data.
As nearly the only stateful part in Kubernetes, the etcd cluster requires a highly available setup, requiring online cluster version upgrade, cluster online resizing, node failure and replacement with no downtime. It won’t be easy to set up. I can foresee it will become one of the challenges to keep the downtime short for the SRE team.
But it won’t be my concern very soon, because there was a rumor that AWS was going to offer a managed solution for Kubernetes, and it is AWS EKS we know today.
Thanks to AWS EKS, we can have a managed solution on the control plan of Kubernetes (i.e. etcd)
In AWS re:invent 2017, the AWS announced their managed Kubernetes offer, AWS Elastic Container for Kubernetes (aka EKS). I think it would be exciting news for every SRE team running the production environment on AWS, let alone the HK01 SRE team. We read every piece of its documentation, applied for its beta test. We know EKS is not perfect (eg. compute nodes are unmanaged in day 1, the installation requires manual linking between the master node and compute nodes). But with a managed Kubernetes control plane (with etcd cluster), we are prepared to take that challenge and ready to embrace the new age.
AWS announced the general availability of EKS in summer 2018. Unfortunately, it did not launch in the AWS Asia region which HK01 infrastructure located in round 1. However, the SRE team was very eager to have a pilot run on Kubernetes. We picked one of the web services, namely our work-in-progress Single-Sign-on component and the related components, migrated them to AWS region Oregon (us-west-2), the region with AWS EKS available.
In that summer, we read documentation, either from AWS or the community; we built the Kubernetes cluster and test various failure scenarios; we learned and shared that knowledge among the SRE team; and finally we pushed the application container to Kubernetes cluster.
In Late July 2018, we launched the production EKS cluster with our Single-Sign-on component. We were able to deliver the service online on time while building it on the Kubernetes technology, which none of us were experts in it 3 months ago.
Polishing Kubernetes platform
At this point, we knew that Kubernetes works great for us, but it was not polished yet as our daily operation platform. Kubernetes itself is just a container orchestration platform, not anything more.
We have to re-create foundations and the existing workflow in Kubernetes, including (but not limited to)
- CI/CD flow (we picked TravisCI, Keel.sh, and later fluxcd),
- monitoring stack (we picked EFK and Datadog),
- L7 load balancer gateway (we pick Traefik),
- container image registry (we pick ECR)
The team created an internal Helm chart as a generic template for web applications, to encapsulate the complexity of the Kubernetes manifest, leaving a simple configuration file (namely, Helm’s values.yaml), for every backend developer and also our junior system engineers to pick up Kubernetes skills.
The team also built an internal Kubernetes extension — SecretRelease Operator. It allows us to manage Kubernetes Secret in a GitOps way, while storing secrets in encrypted form in Git. Since the secret is encrypted, it is safe to host in 3rd party Git Repository, like GitHub. The secret will be decrypted only by the controller running in the target Kubernetes cluster via Mozilla SOPS and AWS KMS.
Kubernetes as our major platform
In 2019, We have various Internet products launched, including 01心意 (charity donation platform), Letzgoal (mobile app for the runner), 開講 (UGC platform), and 一網打盡 (e-commence platform). We put every new application and microservice on the well-polished Kubernetes platform. It works well. It saves the SRE team more than 50% setup time, compared to our previous EC2 Instance-based setup. We can now deploy a new release of micro-service within 5 minutes and the system supports an average of 10+ releases per day.
While we are deploying new microservices on the Kubernetes, we also keep migrating the existing service/website (without downtime) to the Kubernetes. I would say migrating is even more difficult than building-from-sketch. It took us about nearly another year to migrate most existing services to the Kubernetes cluster.
In the summer of 2020, finally, we have migrated the website www.hk01.com, the flagship product of the HK01 tech team, to the Kubernetes platform. The planning takes a few months and we have tested the site several times before launching it into production.
Troubleshooting Kubernetes at production
AWS EKS is awesome, except the time when we are facing issues in production environment :)
When we were migrating more services to Kubernetes, we reached critical mass in our Kubernetes cluster and we start to see various issues due to traffic volume, cluster size. Among those issues, I would like to share two of them here:
- Keep-alive issue between Ingress and Node.js backend
- Connection timeout issue due to conntrack
Keep-alive issue between Ingress and Node.js backend
Ingress is one of the nice abstraction offered by Kubernetes. It serves as a virtual layer-7 (HTTP) load balancer and allows us to routing traffic to the individual microservices hosted on the Kubernetes cluster.
There are quite a few Ingress implementations (aka Ingress Controller) in the market, and as I mentioned before, we picked Traefik, one of the popular solutions in the Kubernetes community.
One day, we figure there are abnormal HTTP 502 bad gateway errors in our Traefik access log. It occurred several times in a single day, and seems like to happen randomly and also rare — those 502 requests are less than 0.1% of the daily volume of HTTP requests.
We looked into the Node.js backend and found no evidence that those 502 responses are from the backend.
We cross-checked the deployment configuration of the backend in the pre-Kubernetes stage. One of the major changes, during the container migration, is that, we removed the nginx reverse proxy before the Node.js backend. The nginx is just performing as a simple reverse proxy via proxy_pass
, similar to the following example:
location / {
proxy_pass 127.0.0.1:3000;
}
We know that the nginx reverse proxy downgrades the backend request to HTTP/1.0 by default. Now it seems like the issue is due to how Node.js handle HTTP/1.1 traffic directly. After some research, it turns out that Node.js is with a (poor) default HTTP/1.1 keep-alive value of 2 seconds (ref: [1], [2] ). The only way to override the value explicitly in the code:
const express = require("express");const app = express();
const server = app.listen(8080);
server.keepAliveTimeout = 61 * 1000;
At the same time, Traefik 1 is with a hardcoded backend connection HTTP Keep-Alive timeout value of 90 seconds, until Traefik 2.0 makes it a configurable parameter (ref: [3]).
With the information combined, we have a complete picture of what happened: the Node.js backend dropped the previous connection due to keep-alive timeout, at the same time, Traefik plans to reuse the connection to make another request. The mismatched information at both sides leads to the error. The following chart illustrates the details more clearly:
Hence, we had to disable the backend connection keepalive support in Traefik, as a workaround, before we update every piece of Node.js on the timeout setting. In theory, this workaround costs more computational power and also introduces more TCP connections; In practice, the overhead incurred is still acceptable for a temporary workaround.
Connection timeout issue due to conntrack
On another day, we figured there were a small number of errors when our backend code initializes connections to another microservice. There are various symptoms, including the most observed ETIMEDOUT
error thrown from the client. The error happened more frequently when we migrated more services to the Kubernetes cluster. Then we knew it is not a rare case, but a systemic issue impacting our platform and service.
After some investigation, we found that, when the application invokes HTTP requests to some of the microservices which is not located in the same subnet of the application, there will be a chance that ETIMEDOUT
error occurs.
This time, we figured out an issue in underlying container networking, causing the error we saw — the Linux Kernel, namely the kernel module conntrack
, has a known race condition on port allocation when doing source network address translation (SNAT).
Xing Tech team wrote a very comprehensive story on the details of this kernel issue, and also the workarounds. In short, they recommended switching the port allocation algorithm to fully random (based on PRNG), to reduce the probability of the race condition to nearly zero.
In AWS EKS, the container networking is managed by the networking plugin amazon-vpc-cni-k8s. We found that, in amazon-vpc-cni-k8s, there is an option called AWS_VPC_K8S_CNI_RANDOMIZESNAT
to configure the port allocation algorithm used. However, its default value sets the port allocation algorithm to hash random instead of fully random.
A bugfix, which switches to fully random algorithm as the default, was recently accepted, also will be shipped with the upcoming v1.7.0 (rc1 as of the writing).
Since the bugfix is still in the pre-release stage, we had to manually apply the environment variable, to override the default values, as the fix of the issue. The ETIMEDOUT
error has gone after we gradually release this fix to the production environment.
Luckily, with the great team and the great tools (including the internal Helm chart we created), the issues mentioned above did not cause any major outage to our platform. We have only less than 30 minutes of downtime of our Kubernetes cluster in 2019.
Lesson learnt
The lesson learnt in the last 2 years does deserve another writing. But I still want to take this chance and highlight two of the lessons here:
- Never use
kubectl
directly
It would be tempting to prepare Kubernetes manifest and apply them viakubectl
command directly. DON’T. Never. Please do it in a managed way. For example, managing the Kubernetes cluster in a GitOps way (via fluxcd) is a highly recommended way.
“kubectl is the new ssh. Limit access and only use it for deployments when better tooling is not available.”
- Explicitly specify the request and limit values of CPU and memory properly
We encountered issues related to resource allocation several times, including single node overloading, strange out-of-memory killing to other pods. Setting resource values properly not only helps Kubernetes scheduler working, but also let the cluster auto-scaling to a proper size. It reduces the chance that you encountered a strange issue — and finally it turns out a resource allocation issue.
The next steps
We now have a reliable Kubernetes platform running. But it is not the end of our journey. With Kubernetes, we can do some automation that we have never thought about before. The following are the directions that the team is going to look at:
- More GitOps
As a user of fluxcd, we are looking for further extending and customizing the workflow, for example, adding some manual approval for code release of some important backend service. We are exploring ArgoCD, GitOps Toolkit, etc, and see how we can implement those workflows under the GitOps framework. - Further leveraging Operators
We are already using extensions (aka. Operator in Kubernetes) like Helm Operator, KubeDB Operator, etc. We are checking on other Operators like crossplane.io and putting more infrastructure configuration into Kubernetes, the one that we believe the next generation infrastructure configuration hub is. - Developing more extensions
We have developed our Operator for Secret management in GitOps. We are planning to have it open source, helping the community to solve the problem similar to what we faced. We are also looking forward to creating more Operators to automate some of the routine work.
Last words
Those great works cannot be delivered without the entire Site Reliability Engineering (SRE) team. Also many thanks to the rest of the HK01 Tech team who helped with this transition.
Thanks for reading. If you enjoyed reading this article, feel free to support us by hitting the clap button 👏 and to help others finding this article. We are hiring, including Site Reliability Engineer, job descriptions can be found HERE. You are welcome to drop me a message if further questions.