Chef 實戰 part4 — 設定 Elasticsearch cluster

Luyo
verybuy-dev
Published in
22 min readAug 28, 2017

上一篇我完成了 IK 安裝與更新詞庫的機制,但必須手動執行 chef-client 詞庫才會更新,所以還需要用 chef-client cookbook 及 role 來建立自動更新的機制,在 Learn Chef Rally 學習筆記 part7 — 除錯及定期執行 chef-client 有學到這些用法。

問題是如果詞庫被更新了,就要重新啟動 elasticsearch service 才會生效,但這重啟的動作會造成數十秒鐘的 downtime,這是我不希望發生的情況,至少不要在尖峰時刻發生。

目前想到的解決方法有兩種:

  1. 讓 Chef 在離峰時間執行 chef-client ,使重新啟動 elasticsearch 造成的影響縮小
  2. 再開幾台機器,設定 cluster,搭配上較長的 splay 使同時重啟的機會變小

第 1 種方式是比較簡單,但我遲早要面對 clustering 這件事,就還是走 2 的方式吧。

1. Bootstrap node

先去 EC2 console 再開一台新的 CentOS 7,然後用 knife bootstrap 初始化這台機器,將這個新的 node 取名為 es-2

$ cd ~/learn-chef/cookbooks/elasticsearch_ik/
$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.
ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
(...略)
172.31.21.50 ================================================================================
172.31.21.50 Error executing action `install` on resource 'elasticsearch_plugin[analysis-ik]'
172.31.21.50 ================================================================================
172.31.21.50
172.31.21.50 Mixlib::ShellOut::ShellCommandFailed
172.31.21.50 ------------------------------------
172.31.21.50 Expected process to exit with [0], but received '1'
172.31.21.50 ---- Begin output of ["/usr/share/elasticsearch/bin/elasticsearch-plugin", "install", "https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip"] ----
172.31.21.50 STDOUT: Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
172.31.21.50 STDERR: which: no java in (/opt/chef/embedded/bin:/opt/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sfw/bin:/sbin:/bin:/usr/sbin:/usr/bin)

結果在安裝 plugin 的時候失敗了,原因是沒有安裝 java。

那為什麼之前在 Chef 實戰 part2 中,recipe 中並沒有安裝 java 的設定也可以順利執行 chef-client ?我想是因為 part2 是接著 Chef 實戰 part1 做的,用的是同一台 node,所以 java 早就已經安裝好了。

2. 修正 package bug

開了一台新機器,馬上就發現我的 cookbook 有瑕疵,先來修正 recipes/default.rb ,參考實戰 part1 的設定,在最開始加入 package resource:

package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'(...略)

上傳至 Chef server:

$ knife cookbook upload elasticsearch_ik
Uploading elasticsearch_ik [0.2.0]
Uploaded 1 cookbook.

再次嘗試 bootstrapping 新的 node:

$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
Node es-2 exists, overwrite it? (Y/N) Y
Client es-2 exists, overwrite it? (Y/N) Y
(...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 5/49 resources updated in 48 seconds

一開始會問兩個問題,都按 Y 就好。這次就順利跑完 bootstrap 了。

3. 更新 network.host

測試一下 curl 能不能抓到東西:

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

呃,失敗了,看來 node es-2 的 elasticsearch service 沒有順利跑起來。

研究了一下,發現是 network.host 設定的問題,目前這個設定是寫死的 node es-1 的 IP,但 node es-2 不能用別人的 IP 當 host 啊,所以這個設定要調整一下。

翻了一下 elastic 文件,將 elasticsearch_configurenetwork.host 更新為機器的內網 IP:

elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'development',
'network.host' => '_site_',
})
end

上傳並更新 node es-2

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * elasticsearch_service[elasticsearch] action configure
172.31.21.50 * directory[/var/run/elasticsearch-elasticsearch] action create (up to date)
172.31.21.50 * template[/etc/init.d/elasticsearch] action create (up to date)
172.31.21.50 * directory[/usr/lib/systemd/system-elasticsearch] action create (up to date)
172.31.21.50 * template[/usr/lib/systemd/system/elasticsearch.service] action create (up to date)
172.31.21.50 Recipe: elasticsearch_ik::default
172.31.21.50 * service[elasticsearch] action enable (skipped due to only_if)
172.31.21.50 * service[elasticsearch] action start (skipped due to only_if)
172.31.21.50 (up to date)
(...略)
172.31.21.50 * service[elasticsearch] action restart (skipped due to only_if)

curl 看看結果:

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

service 還是沒有跑起來,往上看 log,可以發現 start 被 skip 掉了。奇怪,難道我自己的 service resource 會影響到 elasticsearch_service resouce?

先把自己寫的 serivce resource 註解掉:

#service 'elasticsearch' do
# action [:restart]
# only_if { dict.updated_by_last_action? }
#end

上傳並更新 node es-2 ,結果 service 就順利跑起來了。可見我的這個 service resource 的確會影響 elasticsearch_service ,必須修理這個 bug。

4. 使用 resource notifications

google 了一陣子,發現 Chef resource 還有一個 notify 的用法,參考這篇文件,修改 remote_file resource:

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
source 'https://api.xxx.ooo/get-my-dict'
owner node['elasticsearch']['user']['username']
group node['elasticsearch']['user']['groupname']
mode '0660'
action :create
notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end

這段的意思就是當這個 remote_file resource 狀態改變的時候,就去通知 elasticsearch_service[elasticsaerch] 這個 resource 做 restart 的動作。早知道有這麼好用的東西當初在實戰 part3 的時候就不會用 only_if 的做法了。

上傳並更新 node es-2

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * remote_file[/etc/elasticsearch/analysis-ik/verybuy.dic] action create
172.31.21.50 - update content in file /etc/elasticsearch/analysis-ik/verybuy.dic from e56924 to eb541d
172.31.21.50 (current file is binary, diff output suppressed)
172.31.21.50 - restore selinux security context
172.31.21.50 Recipe: elasticsearch::default
172.31.21.50 * elasticsearch_service[elasticsearch] action restart
172.31.21.50 * service[elasticsearch] action restart
172.31.21.50 - restart service service[elasticsearch]
172.31.21.50
172.31.21.50
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 3/51 resources updated in 13 seconds

因為這時候我的詞庫檔剛好有更新,所以可以看到 restart service 這段,驗證了 notify 有順利運作;再執行一次更新 node es-2 的動作,就應該不會再 restart 一次了:

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * service[elasticsearch] action nothing (skipped due (...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 0/49 resources updated in 11 seconds

沒錯,因為不符合 notify 條件,所以 service 沒被重啟。

不過我還要再確認一件事,就是如果原本 service 沒跑起來,執行 chef-client 之後會不會自動幫我 start。

連進 node es-2 下指令 sudo service elasticsearch stop ,然後再執行一次 chef-client

$ knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * service[elasticsearch] action start
172.31.21.50 - start service service[elasticsearch]
(...略)

最後用 curl 確認一下:

$ curl 172.31.21.50:9200/_cluster/health?pretty
{
"cluster_name" : "production",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

呼,到這裡總算是把 node es-2 處理好了。

5. 設定 discovery.zen.ping.unicast.hosts

但這邊又出現了一個問題,就是 health 回傳的結果中, "number_of_nodes" : 1 表示這兩台 nodes 並沒有發現彼此的存在,這是因為我還沒有設定 discovery.zen.ping.unicast.hosts 這個參數。

修改 recipes/default.rbelasticsearch_configure

elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'development',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => ['172.31.21.70', '172.31.21.50']
})
end

上傳並更新 nodes,這次我把 knife ssh 的參數改為 'name:es-*' ,可以一次更新兩台 nodes:

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-*' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)

因為設定檔有更新,所以手動重啟 service:

$ knife ssh 'name:es-*' 'sudo service elasticsearch restart' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
172.31.21.50 Restarting elasticsearch (via systemctl): [ OK ]
172.31.21.70 Restarting elasticsearch (via systemctl): [ OK ]

確認 curl 內容:

$ curl 172.31.21.70:9200/_cluster/health?pretty
{
"cluster_name" : "production",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

很好,他們成功找到彼此了。

6. 更新 discovery.zen.ping.unicast.hosts

但是這樣還是有一點小小的不舒服,因為 IP 是寫死的。如果可以動態產生這個 IP 的列表就更完美了。

google 了一下找到這個 issue,下面有一個回答看起來是就是我想要的。但我試了一下之後發現可能是版本的差異,我必須用陣列的方式設定這個參數而非字串。

再次修改 recipes/default.rbelasticsearch_configure

elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'production',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => elk_nodes
})
end

然後上傳並更新 nodes 就 OK 了,最後 nodes 裡的 /etc/elasticsearch/elasticsearch.yml 會長類似這樣:

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# THIS FILE IS MANAGED BY CHEF, DO NOT EDIT MANUALLY, YOUR CHANGES WILL BE OVERWRITTEN!
#
# Please see the documentation for further information on configuration options:
# <https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html>
#
---
cluster.name: production
node.name: es-1
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
network.host: _site_
discovery.zen.ping.unicast.hosts:
- 172.31.21.50
- 172.31.21.70

7. 小結

到這邊告一段落,我已經完成了 cluster 的設定。到現在 cookbook elasticsearch_ik 的設定如下:

  • 檔案 recipes/default.rb
package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'production',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => elk_nodes
})
end
elasticsearch_plugin 'analysis-ik' do
url 'https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip'
action :install
end
template '/etc/elasticsearch/analysis-ik/IKAnalyzer.cfg.xml' do
source 'IKAnalyzer.cfg.xml.erb'
end

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
source 'https://api.xxx.ooo/get-my-dict'
owner node['elasticsearch']['user']['username']
group node['elasticsearch']['user']['groupname']
mode '0660'
action :create
notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end
  • 檔案 attributes/default.rb
default['elasticsearch']['install']['version'] = '5.5.1'
  • 檔案 metadata.rb
name 'elasticsearch_ik'
maintainer 'The Authors'
maintainer_email 'you@example.com'
license 'All Rights Reserved'
description 'Installs/Configures elasticsearch_ik'
long_description 'Installs/Configures elasticsearch_ik'
version '0.2.0'
chef_version '>= 12.1' if respond_to?(:chef_version)
depends 'elasticsearch'

在這篇文章中我學到了以下幾件事:

  1. elasticsearch_configure 設定 cluster 的方法
  2. Resource notifications 蠻實用的
  3. 如何以 search 動態產生 nodes 的 IP 陣列,以後應該也可以用它來生出更多有關 nodes 的資訊到程式碼當中

這篇文章本來是預想的內容是利用 rolechef-client cookbook 來完成自動更新的設定,結果光設定 cluster 就弄了一整天加一整篇文章,真是道行太淺。下一篇應該就可以好好來設定自動更新了吧。

--

--