Chef 實戰 part5 — 定期自動跑 chef-client 機制

Luyo

Published in

verybuy-dev

12 min readAug 29, 2017

接續前篇 Chef 實戰 part4 — 設定 Elasticsearch cluster，現在總算可以來設定 role 及 chef-client cookbook 實現自動更新機制了。

1. 確認 Berkshelf 設定

在 Learn Chef Rally 學習筆記 part7 — 除錯及定期執行 chef-client 的時候我已安裝好 chef-client cookbook 了，這邊再重新確認一下即可。

切換至 ~/learn-chef/ 目錄底下，打開 Berksfile ：

source 'https://supermarket.chef.io'
cookbook 'chef-client'

確認這個設定沒有問題，就可以執行 berks install 指令：

$ berks install
Resolving cookbook dependencies...
Using chef-client (8.1.8)
Using cron (4.1.3)
Using logrotate (2.2.0)
Using ohai (5.2.0)
Using windows (3.1.2)
Using compat_resource (12.19.0)

沒有需要更新的，所以立馬就跑完了，再來是 berks upload ：

$ berks upload
Skipping chef-client (8.1.8) (frozen)
Skipping compat_resource (12.19.0) (frozen)
Skipping cron (4.1.3) (frozen)
Skipping logrotate (2.2.0) (frozen)
Skipping ohai (5.2.0) (frozen)
Skipping windows (3.1.2) (frozen)

也沒有需要更新的，到這邊就可以確認 chef-client 已經在 Chef server 上待命了。

2. 設定 role

編輯 JSON 檔

再來是設定 role。在 Learn Chef Rally 學習筆記 part7 的時候我已經建立了一個 ~/learn-chef/roles/ 的目錄，先切換到這個目錄底下：

$ cd ~/learn-chef/roles/

然後參考 web.json 檔案，再生一個 es.json 內容如下：

{
  "name": "es",
  "description": "es server role.",
  "json_class": "Chef::Role",
  "default_attributes": {
    "chef_client": {
      "interval": 7200,
      "splay": 1800
    }
  },
  "override_attributes": {
  },
  "chef_type": "role",
  "run_list": ["recipe[chef-client::default]",
               "recipe[chef-client::delete_validation]",
               "recipe[elasticsearch_ik::default]"
  ],  "env_run_lists": {
  }
}

ineterval = 7200， splay = 1800，表示每 2 小時會跑一次 chef-client ，且隨機在 1800 秒內挑一個時間去執行。

決定 interval、splay

來算一下兩個 node 同時被重新啟動造成 downtime 的秒數期望值，再考慮要不要更動這組設定。

我試了一下我的 instance 重啟 elasticsearch service 的 downtime 大約是 12 秒，保險一點算 15 秒好了，然後假設每次執行 chef-client 都會觸發重啟 service 的條件。

那麼當 node 1 處於 downtime 狀態的其中某 1 秒，node 2 也處於 downtime 狀態的機率大約是 15/1800；因為 node1 的 downtime 會有 15 秒，所以兩台同時處於 downtime 的期望值就是 15x15/1800。而 chef-client 每天會執行 12 次，假設一個月有 30 天，平均每個月會發生兩台同時處於 downtime 的秒數就是 (15x15/1800)x12x30 = 45 秒，換算成 SLA：

SLA = 1–15x(15/1800)x12x30/(86400x30) = 99.9983%

雖然不算是很理想，但考慮到這是最壞的情況，而實際情況是不太可能每次執行 chef-client 都觸發 service restart，所以評估下來應該算是可以接受的。而且未來只要再加一個 node，在最差的情況下三台同時處於 downtime 的平均秒數會變成 15x(15/1800)²x12x30 = 0.375 秒，而 SLA 會變成：

SLA = 1–15x(15/1800)²x12x30/(86400x30) = 99.999986%

這樣可靠性應該算就蠻能接受了吧，那就保留這組設定值。誒，以上算式是我自己掰出來的，不保證正確喔。

產生 role

再來就可以告訴 Chef server 我們要產生這個 role：

$ knife role from file es.json
Updated Role es

確認一下 role 的資訊是否正確：

$ knife role show es
chef_type:           role
default_attributes:
  chef_client:
    interval: 7200
    splay:    1800
description:         es server role.
env_run_lists:
json_class:          Chef::Role
name:                es
override_attributes:
run_list:
  recipe[chef-client::default]
  recipe[chef-client::delete_validation]
  recipe[elasticsearch_ik::default]

綁定 role 至 node 上

再來就可以綁定 role 到我的 nodes 上啦：

$ knife node run_list set es-* "role[es]"
ERROR: The object you are looking for could not be found
Response: node 'es-*' not found

呃，這個指令不給我用 wildcard 的寫法，只好一台一台綁：

$ knife node run_list set es-1 "role[es]"
es-1:
  run_list: role[es]$ knife node run_list set es-2 "role[es]"
es-2:
  run_list: role[es]

最後就是來試跑 chef-client 了：

$ knife ssh 'role:es' 'sudo chef-client' --ssh-user centos --identity-file
~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.70 Running handlers:
172.31.21.70 Running handlers complete
172.31.21.70 Chef Client finished, 11/82 resources updated in 24 seconds
172.31.21.50
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 19/83 resources updated in 26 seconds

沒有問題，這樣就大功告成了！

更新 recipe

等等，突然想到我的 recipe 某一行還是用 name:es-* 這個條件，趕快來更新一下 elasticsearch_ik/recipes/default.rb：

#elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq 刪掉這行，換成下面這行
elk_nodes = search(:node, 'role:es').map(&:ipaddress).sort.uniq

3. 新增第三台 node

最後我想再看一個 node 起來，試試看用目前的整組設定是不是真的能夠一步到位。

先去 EC2 console 再開一台 CentOS t2.micro，把內網 IP 複製起來，對它做 bootstrapping：

$ knife bootstrap 172.31.18.83 --ssh-user centos --sudo --identity-file ~/.ssh/test.pem --node-name es-3 --run-list 'role[es]'
(...略)
172.31.18.83    11>> elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
(...略)
172.31.18.83 [2017-08-29T11:15:20+00:00] ERROR: No attribute `node['ipaddress']' exists on
172.31.18.83 the current node. Specifically the `ipaddress' attribute is not
172.31.18.83 defined. Please make sure you have spelled everything correctly.
172.31.18.83
172.31.18.83 [2017-08-29T11:15:20+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

結果還是失敗了，原因是 recipe 中的 node['ipaddress'] 屬性抓不到。

google 了很久都找不到解法，最後發現 knife bootstrap 有一個 --json-attribute 選項，說不定可以利用這東西來設定 node 的屬性。

死馬當活馬醫，跟剛剛一樣的指令，再加上 --json-attribute ，內容是 JSON 格式 '{"ipaddress":"172.31.18.83"}'：

$ knife bootstrap 172.31.18.83 --ssh-user centos --sudo --identity-file ~/.ssh/test.pem --node-name es-2 --run-list 'role[es]' --json-attributes '{"ipaddress":"172.31.18.83"}'
(...略)

本來完全不抱期望，結果 bootstrap 居然成功了！

但我還要再跑一次 chef-client 讓另外兩個 node 去更新 IP 列表才行，跑完之後確認一下 cluster 有 3 個 nodes：

$ curl 172.31.21.70:9200/_cluster/health?pretty
{
  "cluster_name" : "development",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

確認 es-3 順利加入 cluster 囉。

太感動了，至此總算完成了可以一道指令搞定一個 elasticsearch node 的 chef repository，真是成就感破表啊！

4. 從 Chef server 上移除 node

接下來要把多開的機器關掉，留兩台下來就好，順便可以練習一下將 node 從 Chef server 上移除的指令。

移除需要兩個步驟，第一步：

$ knife node delete es-3 --yes
Deleted node[es-3]

第二步：

$ knife client delete es-3 --yes
Deleted client[es-3]

接下來就可以把機器關掉了。

5. 確認 node 有自動跑 chef-client

過了好一陣子之後，回來下指令，確認一下 node 的自動更新機制有在運作：

$ knife status 'role:es' --run-list
46 minutes ago, es-2, ["role[es]"], centos 7.3.1611.
1 minute  ago, es-1, ["role[es]"], centos 7.3.1611.

剛好遇到 es-1 前一分鐘才更新，這樣就確認更新機制正常囉。

你可能會覺得奇怪，為什麼兩台更新時間差距超過半小時？ splay 不是只有 30 分鐘嗎？

其實是因為我的 es-2 有重開過，設定好 chef-client 的時間跟 es-1 是不一樣的，所以會有時間差。不過我們也可以從這裡觀察到一點：每台 node 的排程時間是各自獨立的！所以前面算的 SLA 如果是以這個情況來看，同時 downtime 的機率就會變得更小，可靠度就變得更高了。

心得

其實做到這邊，雖然已經完成了一個能用的 chef repository，但越是往前進，越是發現自己的不足，例如接下來我希望能夠針對環境做對應的設定、以及需要寫一些測試來確認 cookbook 能正確運作等等，這些目前我都還沒有學到，很多用法也都是自己撞牆撞出來的，覺得急需回 Learn Chef Rally 充電了。

上一篇：Chef 實戰 part4 — 設定 Elasticsearch cluster

下一篇：Chef 實戰 part6 — 用 Chef 指令在 EC2 開機器