Chef 實戰 part4 — 設定 Elasticsearch cluster

Published in

verybuy-dev

22 min readAug 28, 2017

在上一篇我完成了 IK 安裝與更新詞庫的機制，但必須手動執行 chef-client 詞庫才會更新，所以還需要用 chef-client cookbook 及 role 來建立自動更新的機制，在 Learn Chef Rally 學習筆記 part7 — 除錯及定期執行 chef-client 有學到這些用法。

問題是如果詞庫被更新了，就要重新啟動 elasticsearch service 才會生效，但這重啟的動作會造成數十秒鐘的 downtime，這是我不希望發生的情況，至少不要在尖峰時刻發生。

目前想到的解決方法有兩種：

讓 Chef 在離峰時間執行 chef-client ，使重新啟動 elasticsearch 造成的影響縮小
再開幾台機器，設定 cluster，搭配上較長的 splay 使同時重啟的機會變小

第 1 種方式是比較簡單，但我遲早要面對 clustering 這件事，就還是走 2 的方式吧。

1. Bootstrap node

先去 EC2 console 再開一台新的 CentOS 7，然後用 knife bootstrap 初始化這台機器，將這個新的 node 取名為 es-2：

$ cd ~/learn-chef/cookbooks/elasticsearch_ik/
$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.
ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
(...略)
172.31.21.50     ================================================================================
172.31.21.50     Error executing action `install` on resource 'elasticsearch_plugin[analysis-ik]'
172.31.21.50     ================================================================================
172.31.21.50
172.31.21.50     Mixlib::ShellOut::ShellCommandFailed
172.31.21.50     ------------------------------------
172.31.21.50     Expected process to exit with [0], but received '1'
172.31.21.50     ---- Begin output of ["/usr/share/elasticsearch/bin/elasticsearch-plugin", "install", "https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip"] ----
172.31.21.50     STDOUT: Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
172.31.21.50     STDERR: which: no java in (/opt/chef/embedded/bin:/opt/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sfw/bin:/sbin:/bin:/usr/sbin:/usr/bin)

結果在安裝 plugin 的時候失敗了，原因是沒有安裝 java。

那為什麼之前在 Chef 實戰 part2 中，recipe 中並沒有安裝 java 的設定也可以順利執行 chef-client ？我想是因為 part2 是接著 Chef 實戰 part1 做的，用的是同一台 node，所以 java 早就已經安裝好了。

2. 修正 package bug

開了一台新機器，馬上就發現我的 cookbook 有瑕疵，先來修正 recipes/default.rb ，參考實戰 part1 的設定，在最開始加入 package resource：

package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'(...略)

上傳至 Chef server：

$ knife cookbook upload elasticsearch_ik
Uploading elasticsearch_ik [0.2.0]
Uploaded 1 cookbook.

再次嘗試 bootstrapping 新的 node：

$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
Node es-2 exists, overwrite it? (Y/N) Y
Client es-2 exists, overwrite it? (Y/N) Y
(...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 5/49 resources updated in 48 seconds

一開始會問兩個問題，都按 Y 就好。這次就順利跑完 bootstrap 了。

3. 更新 network.host

測試一下 curl 能不能抓到東西：

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

呃，失敗了，看來 node es-2 的 elasticsearch service 沒有順利跑起來。

研究了一下，發現是 network.host 設定的問題，目前這個設定是寫死的 node es-1 的 IP，但 node es-2 不能用別人的 IP 當 host 啊，所以這個設定要調整一下。

翻了一下 elastic 文件，將 elasticsearch_configure 裡 network.host 更新為機器的內網 IP：

elasticsearch_configure 'elasticsearch' do
  allocated_memory '256m'
  configuration ({
    'cluster.name' => 'development',
    'network.host' => '_site_',
  })
end

上傳並更新 node es-2：

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50   * elasticsearch_service[elasticsearch] action configure
172.31.21.50     * directory[/var/run/elasticsearch-elasticsearch] action create (up to date)
172.31.21.50     * template[/etc/init.d/elasticsearch] action create (up to date)
172.31.21.50     * directory[/usr/lib/systemd/system-elasticsearch] action create (up to date)
172.31.21.50     * template[/usr/lib/systemd/system/elasticsearch.service] action create (up to date)
172.31.21.50   Recipe: elasticsearch_ik::default
172.31.21.50     * service[elasticsearch] action enable (skipped due to only_if)
172.31.21.50     * service[elasticsearch] action start (skipped due to only_if)
172.31.21.50      (up to date)
(...略)
172.31.21.50   * service[elasticsearch] action restart (skipped due to only_if)

用 curl 看看結果：

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

service 還是沒有跑起來，往上看 log，可以發現 start 被 skip 掉了。奇怪，難道我自己的 service resource 會影響到 elasticsearch_service resouce？

先把自己寫的 serivce resource 註解掉：

#service 'elasticsearch' do
#  action [:restart]
#  only_if { dict.updated_by_last_action? }
#end

上傳並更新 node es-2 ，結果 service 就順利跑起來了。可見我的這個 service resource 的確會影響 elasticsearch_service ，必須修理這個 bug。

4. 使用 resource notifications

google 了一陣子，發現 Chef resource 還有一個 notify 的用法，參考這篇文件，修改 remote_file resource：

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
  source 'https://api.xxx.ooo/get-my-dict'
  owner node['elasticsearch']['user']['username']
  group node['elasticsearch']['user']['groupname']
  mode '0660'
  action :create
  notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end

這段的意思就是當這個 remote_file resource 狀態改變的時候，就去通知 elasticsearch_service[elasticsaerch] 這個 resource 做 restart 的動作。早知道有這麼好用的東西當初在實戰 part3 的時候就不會用 only_if 的做法了。

上傳並更新 node es-2 ：

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50   * remote_file[/etc/elasticsearch/analysis-ik/verybuy.dic] action create
172.31.21.50     - update content in file /etc/elasticsearch/analysis-ik/verybuy.dic from e56924 to eb541d
172.31.21.50     (current file is binary, diff output suppressed)
172.31.21.50     - restore selinux security context
172.31.21.50 Recipe: elasticsearch::default
172.31.21.50   * elasticsearch_service[elasticsearch] action restart
172.31.21.50     * service[elasticsearch] action restart
172.31.21.50       - restart service service[elasticsearch]
172.31.21.50
172.31.21.50
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 3/51 resources updated in 13 seconds

因為這時候我的詞庫檔剛好有更新，所以可以看到 restart service 這段，驗證了 notify 有順利運作；再執行一次更新 node es-2 的動作，就應該不會再 restart 一次了：

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50   * service[elasticsearch] action nothing (skipped due (...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 0/49 resources updated in 11 seconds

沒錯，因為不符合 notify 條件，所以 service 沒被重啟。

不過我還要再確認一件事，就是如果原本 service 沒跑起來，執行 chef-client 之後會不會自動幫我 start。

連進 node es-2 下指令 sudo service elasticsearch stop ，然後再執行一次 chef-client ：

$ knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50     * service[elasticsearch] action start
172.31.21.50       - start service service[elasticsearch]
(...略)

最後用 curl 確認一下：

$ curl 172.31.21.50:9200/_cluster/health?pretty
{
  "cluster_name" : "production",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

呼，到這裡總算是把 node es-2 處理好了。

5. 設定 discovery.zen.ping.unicast.hosts

但這邊又出現了一個問題，就是 health 回傳的結果中， "number_of_nodes" : 1 表示這兩台 nodes 並沒有發現彼此的存在，這是因為我還沒有設定 discovery.zen.ping.unicast.hosts 這個參數。

修改 recipes/default.rb 的 elasticsearch_configure：

elasticsearch_configure 'elasticsearch' do
  allocated_memory '256m'
  configuration ({
    'cluster.name' => 'development',
    'network.host' => '_site_',
    'discovery.zen.ping.unicast.hosts' => ['172.31.21.70', '172.31.21.50']
  })
end

上傳並更新 nodes，這次我把 knife ssh 的參數改為 'name:es-*' ，可以一次更新兩台 nodes：

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-*' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)

因為設定檔有更新，所以手動重啟 service：

$ knife ssh 'name:es-*' 'sudo service elasticsearch restart' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
172.31.21.50 Restarting elasticsearch (via systemctl):     [  OK  ]
172.31.21.70 Restarting elasticsearch (via systemctl):     [  OK  ]

確認 curl 內容：

$ curl 172.31.21.70:9200/_cluster/health?pretty
{
  "cluster_name" : "production",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

很好，他們成功找到彼此了。

6. 更新 discovery.zen.ping.unicast.hosts

但是這樣還是有一點小小的不舒服，因為 IP 是寫死的。如果可以動態產生這個 IP 的列表就更完美了。

google 了一下找到這個 issue，下面有一個回答看起來是就是我想要的。但我試了一下之後發現可能是版本的差異，我必須用陣列的方式設定這個參數而非字串。

再次修改 recipes/default.rb 的 elasticsearch_configure：

elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
  allocated_memory '256m'
  configuration ({
    'cluster.name' => 'production',
    'network.host' => '_site_',
    'discovery.zen.ping.unicast.hosts' => elk_nodes
  })
end

然後上傳並更新 nodes 就 OK 了，最後 nodes 裡的 /etc/elasticsearch/elasticsearch.yml 會長類似這樣：

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# THIS FILE IS MANAGED BY CHEF, DO NOT EDIT MANUALLY, YOUR CHANGES WILL BE OVERWRITTEN!
#
# Please see the documentation for further information on configuration options:
# <https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html>
#
---
cluster.name: production
node.name: es-1
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
network.host: _site_
discovery.zen.ping.unicast.hosts:
- 172.31.21.50
- 172.31.21.70

7. 小結

到這邊告一段落，我已經完成了 cluster 的設定。到現在 cookbook elasticsearch_ik 的設定如下：

檔案 recipes/default.rb ：

package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
  allocated_memory '256m'
  configuration ({
    'cluster.name' => 'production',
    'network.host' => '_site_',
    'discovery.zen.ping.unicast.hosts' => elk_nodes
  })
endelasticsearch_plugin 'analysis-ik' do
  url 'https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip'
  action :install
endtemplate '/etc/elasticsearch/analysis-ik/IKAnalyzer.cfg.xml' do
  source 'IKAnalyzer.cfg.xml.erb'
end

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
  source 'https://api.xxx.ooo/get-my-dict'
  owner node['elasticsearch']['user']['username']
  group node['elasticsearch']['user']['groupname']
  mode '0660'
  action :create
  notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end

檔案 attributes/default.rb ：

default['elasticsearch']['install']['version'] = '5.5.1'

檔案 metadata.rb ：

name 'elasticsearch_ik'
maintainer 'The Authors'
maintainer_email 'you@example.com'
license 'All Rights Reserved'
description 'Installs/Configures elasticsearch_ik'
long_description 'Installs/Configures elasticsearch_ik'
version '0.2.0'
chef_version '>= 12.1' if respond_to?(:chef_version)depends 'elasticsearch'

在這篇文章中我學到了以下幾件事：

用 elasticsearch_configure 設定 cluster 的方法
Resource notifications 蠻實用的
如何以 search 動態產生 nodes 的 IP 陣列，以後應該也可以用它來生出更多有關 nodes 的資訊到程式碼當中

這篇文章本來是預想的內容是利用 role 跟 chef-client cookbook 來完成自動更新的設定，結果光設定 cluster 就弄了一整天加一整篇文章，真是道行太淺。下一篇應該就可以好好來設定自動更新了吧。

上一篇：Chef 實戰 part3 — 安裝 IK Analyzer 及更新詞庫

下一篇：Chef 實戰 part5 — 定期自動跑 chef-client 機制