Giảm tỉ lệ nhiều preemptible node cùng bị terminate trong GKE cluster

Published in

One Mount | Tech

8 min readSep 27, 2021

Bài viết này đề cập đến giải pháp giảm thiểu tỉ lệ nhiều node bị terminate cùng lúc tại một thời điểm do cơ chế preemptible instance của GCP, bằng cách triển khai tool từ open source vào GKE cluster.

1. Vấn đề gặp phải

Một ngày đẹp trời, khi môi trường QC của dự án bị kêu ngất hết từ service này đến service khác 😔, chưa kịp hiểu chuyện gì thì nhận thông tin “sống lại rồi …”. Điều gì đã khiến nhiều service cùng lăn ra chết một lúc, xong “thình lình” sống lại.

2. Tìm hiểu nguyên nhân và phân tích

Sau một hồi kiểm tra thì phát hiện ra nguyên nhân là do node bị preempt. Cụ thể, trong khoảng 30 phút từ 14h25 đến 14h56 có 4 node bị preempt, việc này là nguyên nhân đã dẫn đến service bị gián đoạn.

Cùng đọc lại 1 tí document về preemptible instance tại đây để giải thích cho nguyên nhân trên nhé. Default ở môi trường non production thì các VM và GKE cluster đều được tạo với loại máy preemptible. Preemptible instance là một giải pháp để giảm thiểu giá thành sử dụng dịch vụ trên GCP Cloud, điều này có thể giúp bạn tiết kiệm lên đến 80% chi phí so với việc sử dụng instance bình thường khác. Áp dụng môi trường non production là hợp lý rồi, tiết kiệm chi phí thế cơ mà 😎

Tuy nhiên không như mơ đâu 😂, hạn chế lớn nhất của preemptible instance là sẽ bị GCP thu hồi bất kỳ lúc nào để phục vụ cho mục những đích khác, cụ thể như sau:

Có thể bị terminate bất cứ lúc nào
Bị terminate tối thiểu một lần trong vòng 24h, có thể nhiều hơn.

Vậy lý do tại sao sau khi node bị terminate thì một lúc sau service lại có thể tự sống lại? Trong GKE cluster, các instance không đứng riêng lẻ, mà sẽ được quản lý bởi node pool và managed instance group.

Nhờ tính năng autoscaler trên managed instance group, sau khi một instance bị terminate, instance group sẽ recreate lại một instance khác để đảm bảo số lượng target size được chỉ định. Sau khi instance mới được recreate, các pod sẽ được rescheduled vào instance mới này.
👉 Điều này giải thích cho việc tại sao service lại tự sống lại, sau một khoảng thời gian bị unavailable.

Về cơ bản thì trước khi instance bị terminate do cơ chế của preemptible instance, instance sẽ nhận được một thông báo trước đó 30s. Tuy nhiên khoảng thời này quá ngắn để thực hiện việc drain pod cũng như tạo node mới để reschedule lại pod, tìm hiểu thêm tại preemption process.

3. Hướng giải quyết

Nhận định việc này là do cơ chế preemptible instance, nên khi sử dụng thì sẽ một phần phải chấp nhận chứ khó có thể giải quyết được triệt để 100%. Hướng đi thích hợp sẽ là tìm kiếm giải pháp để giảm tỉ lệ nhiều node cùng bị unavailable khi mọi người đang hăng say làm việc, sẽ hợp lý hơn nhiều nếu việc preempt này diễn ra vào ban đêm, và không nhiều node cùng bị một lúc. Đúng không ạ??? 😎

Sau khi tìm kiếm giải pháp cải thiện tần suất preemptible trên github, medium thì có tool Open source: https://github.com/estafette/estafette-gke-preemptible-killer

Mục đích

Để đạt được “sứ mệnh” như đã đề cập ở trên, tool này khi deploy vào GKE cluster sẽ giúp thực hiện việc chủ động terminate các preemptible node vào các thời điểm khác nhau cho từng node và tất cả đảm bảo diễn ra ngoài khung thời gian office-hours.

How does that work

Trên mỗi node, tool sẽ set một annotation có value chính là thời gian chủ động việc terminate node. Thời gian này được tính toán dựa trên thời gian node được tạo, random trong khoảng thời gian từ tiếng thứ 12 đến tiếng thứ 24 tính từ thời điểm tạo node. Ví dụ, một node được tạo vào thời điểm 0h00, thì thời gian chủ động terminate node đó sẽ random trong khoảng từ 12h00 đến 23h59. Vì là random nên thời gian chủ động terminate mỗi node là khác nhau.

Cùng với đó, sau mỗi khoảng thời gian interval là 600s, tool sẽ thực hiện quét toàn bộ các preemptible node, và sẽ terminate node nếu thời gian trên annotation bé hơn thời điểm hiện tại.

4. Triển khai

4.1. Install helm

Guideline: Helm | Installing Helm

4.2. Sử dụng terraform để tạo custom role, Google service account và phân quyền

Để tool có thể thực hiện được việc terminate node, cần tạo một Google service account để tool sử dụng, service account này được gán custom role compute.instancesDelete và role compute.viewer

4.3. Connect GKE cluster

Sử dụng gcloud để connect vào GKE cluster muốn triển khai tool

gcloud container clusters get-credentials main --zone asia-east1-b --project example-playground --internal-ip

4.4. Create namespace

Tạo namespace để deploy resources cho tool

kubectl create namespace estafette-preemptible-killer

4.5. Install with Helm

Sử dụng Helm để deploy tool vào GKE cluster và namespace vừa tạo

Add repo to your local helm client

# Add repo to your local helm clienthelm repo add estafette/estafette-gke-preemptible-killer https://helm.estafette.io# Verify the repository which added to your local helm clienthelm search repo estafette/estafette-gke-preemptible-killer

Nếu add thành công thì output như sau

NAME                                           CHART     VERSION APP 
estafette/estafette-gke-preemptible-killer     1.2.7        1.2.7

Sau khi add repo, ta sẽ sử dụng Helm để install vào GKE cluster. Ở bước này sẽ khác với guideline triển khai tại Readme, chúng ta sẽ thay thể việc sử dụng service account key bằng workload identity, như vậy sẽ đảm bảo security hơn, tránh việc key bị lộ ra ngoài. Để triển khai workload identity, đầu tiên trong trình quá trình install with Helm ta cần truyền vào thêm custom values để chỉ ra Google service account (GSA) và Kubernetes service account (KSA) sẽ sử dụng. Cụ thể:

helm upgrade --install estafette-gke-preemptible-killer \
--namespace estafette-preemptible-killer estafette/estafette-gke-preemptible-killer \
--set secret.workloadIdentityServiceAccount=wi-preemptible-killer \
--set serviceAccount.name=wi-preemptible-killer \
--set extraEnv.BLACKLIST_HOURS="01:30 - 12:00"

Trong đó:

secret.workloadIdentityServiceAccount: K8S service account (KSA) tool sẽ sử dụng.
serviceAccount.name: Google service account (GSA) được sử dụng để map với KSA.
[Optional] extraEnv.BLACKLIST_HOURS: (value được set theo múi giờ UTC) Là khoảng thời gian tool không được phép chạy. BLACKLIST_HOURS thường được set là khoảng thời gian là office-hours (không cho phép kill node trong khung thời gian làm việc).

4.6. Sử dụng terraform gán quyền để KSA có thể impersonate GSA

Để sử dụng được workload identity, hay nói cách khác để K8S service account (KSA) có thể impersonate với Google service account (GSA), chúng ta cần gán quyền cho KSA có quyền iam.workloadIdentityUser trên GSA.

4.7. Verify lại những resources được tạo bởi Helm trong GKE cluster

K8S Cluster role

$ kubectl get ClusterRole estafette-gke-preemptible-killerNAME                               CREATED AT
estafette-gke-preemptible-killer   2021-09-01T05:48:50Z$ kubectl describe ClusterRole estafette-gke-preemptible-killerName:         estafette-gke-preemptible-killer
Annotations:  meta.helm.sh/release-name: estafette-gke-preemptible-killer
              meta.helm.sh/release-namespace: estafette-preemptible-killerPolicyRule:
  Resources  Non-Resource URLs  Resource Names  Verbs
  ---------  -----------------  --------------  -----
  pods       []                 []              [delete get list]
  nodes      []                 []              [get list patch update delete]

K8S Cluster role binding

$ kubectl describe ClusterRoleBinding estafette-gke-preemptible-killerName:         estafette-gke-preemptible-killer
Labels:       app.kubernetes.io/instance=estafette-gke-preemptible-killer
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=estafette-gke-preemptible-killer
              app.kubernetes.io/version=1.2.7
              helm.sh/chart=estafette-gke-preemptible-killer-1.2.7
Annotations:  meta.helm.sh/release-name: estafette-gke-preemptible-killer
              meta.helm.sh/release-namespace: estafette-preemptible-killer
Role:
  Kind:  ClusterRole
  Name:  estafette-gke-preemptible-killer
Subjects:
  Kind            Name                   Namespace
  ----            ----                   ---------
  ServiceAccount  wi-preemptible-killer  estafette-preemptible-killer

K8S service account

$ kubectl get serviceaccount -n estafette-preemptible-killerNAME                    SECRETS   AGE
default                 1         21d
wi-preemptible-killer   1         21d$ kubectl describe serviceaccount wi-preemptible-killer -n estafette-preemptible-killerName:                wi-preemptible-killer
Namespace:           estafette-preemptible-killer
Annotations:         iam.gke.io/gcp-service-account: wi-preemptible-killer
                     meta.helm.sh/release-name: estafette-gke-preemptible-killer
                     meta.helm.sh/release-namespace: estafette-preemptible-killer
Image pull secrets:  <none>
Mountable secrets:   wi-preemptible-killer-token-586bx
Tokens:              wi-preemptible-killer-token-586bx
Events:              <none>

K8s deployment and pod

$ kubectlget deployments -n estafette-preemptible-killerNAME                              READY  UP-TO-DATE  AVAILABLE   AGE
estafette-gke-preemptible-killer   1/1        1            1     21d$ kubectl get pod -n estafette-preemptible-killerNAME                                              READY  STATUS  AGE
estafette-gke-preemptible-killer-6749595bdf-dph2m  1/1   Running  0

4.8. Verify lại hoạt động của tool

Check log của pod trong GKE

$ kubectl logs estafette-gke-preemptible-killer-6749595bdf-dph2m -n estafette-preemptible-killer<nil> INF Cluster has 9 preemptible nodes
<nil> INF 259 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-0ea4cdd0-36g0
<nil> INF 249 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-0ea4cdd0-5kqn
<nil> INF 711 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-20f804d3-9150
<nil> INF 279 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-20f804d3-jcp9
<nil> INF 310 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-20f804d3-mh7d
<nil> INF 677 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-20f804d3-xzqd
<nil> INF 567 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-ab8da797-j9cp
<nil> INF 404 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-ab8da797-l1pc
<nil> INF 717 minute(s) to go before kill, keeping node host=gke-main-n2d-standard-pool-ab8da797-v8ds
<nil> INF Sleeping for 312 seconds...

Check activity log trên GCP Console Log Explorer cho GSA wi-preemptible-killer@example-playground.iam.gserviceaccount.com. Khi có action chủ động terminate node thì sẽ có activity log. Query như sau:

wi-preemptible-killer@example-playground.iam.gserviceaccount.comlogName="projects/example-playground/logs/cloudaudit.googleapis.com%2Factivity"

Kết luận

Như vậy chúng ta đã triển khai xong tool open source estafette-gke-preemptible-killer vào GKE cluster nhằm mục đích giảm thiểu số lượng cũng như tỉ lệ node bị terminate do preemptible instance trong khung giờ làm việc. Sau một thời gian triển khai và theo dõi, đã có những phản hồi tốt về tần suất bị preempt vào ban ngày giảm hẳn.

Hy vọng bài viết này sẽ giúp ích được cho các bạn khi gặp vấn đề tương tự khi sử dụng preemptible instance trong GKE cluster.

Thanks to Nguyen Chi Cong for suggesting this tool