Chef 實戰 part4 — 設定 Elasticsearch cluster

Luyo
Luyo
Aug 28, 2017 · 22 min read

上一篇我完成了 IK 安裝與更新詞庫的機制,但必須手動執行 chef-client 詞庫才會更新,所以還需要用 chef-client cookbook 及 role 來建立自動更新的機制,在 Learn Chef Rally 學習筆記 part7 — 除錯及定期執行 chef-client 有學到這些用法。

問題是如果詞庫被更新了,就要重新啟動 elasticsearch service 才會生效,但這重啟的動作會造成數十秒鐘的 downtime,這是我不希望發生的情況,至少不要在尖峰時刻發生。

目前想到的解決方法有兩種:

  1. 讓 Chef 在離峰時間執行 chef-client ,使重新啟動 elasticsearch 造成的影響縮小
  2. 再開幾台機器,設定 cluster,搭配上較長的 splay 使同時重啟的機會變小

第 1 種方式是比較簡單,但我遲早要面對 clustering 這件事,就還是走 2 的方式吧。

1. Bootstrap node

先去 EC2 console 再開一台新的 CentOS 7,然後用 knife bootstrap 初始化這台機器,將這個新的 node 取名為 es-2

$ cd ~/learn-chef/cookbooks/elasticsearch_ik/
$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.
ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
(...略)
172.31.21.50 ================================================================================
172.31.21.50 Error executing action `install` on resource 'elasticsearch_plugin[analysis-ik]'
172.31.21.50 ================================================================================
172.31.21.50
172.31.21.50 Mixlib::ShellOut::ShellCommandFailed
172.31.21.50 ------------------------------------
172.31.21.50 Expected process to exit with [0], but received '1'
172.31.21.50 ---- Begin output of ["/usr/share/elasticsearch/bin/elasticsearch-plugin", "install", "https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip"] ----
172.31.21.50 STDOUT: Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
172.31.21.50 STDERR: which: no java in (/opt/chef/embedded/bin:/opt/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sfw/bin:/sbin:/bin:/usr/sbin:/usr/bin)

結果在安裝 plugin 的時候失敗了,原因是沒有安裝 java。

那為什麼之前在 Chef 實戰 part2 中,recipe 中並沒有安裝 java 的設定也可以順利執行 chef-client ?我想是因為 part2 是接著 Chef 實戰 part1 做的,用的是同一台 node,所以 java 早就已經安裝好了。

2. 修正 package bug

開了一台新機器,馬上就發現我的 cookbook 有瑕疵,先來修正 recipes/default.rb ,參考實戰 part1 的設定,在最開始加入 package resource:

package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'(...略)

上傳至 Chef server:

$ knife cookbook upload elasticsearch_ik
Uploading elasticsearch_ik [0.2.0]
Uploaded 1 cookbook.

再次嘗試 bootstrapping 新的 node:

$ knife bootstrap 172.31.21.50 --ssh-user centos --sudo --identity-file=~/.ssh/test.pem --node-name es-2 --run-list 'recipe[elasticsearch_ik]'
Node es-2 exists, overwrite it? (Y/N) Y
Client es-2 exists, overwrite it? (Y/N) Y
(...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 5/49 resources updated in 48 seconds

一開始會問兩個問題,都按 Y 就好。這次就順利跑完 bootstrap 了。

3. 更新 network.host

測試一下 curl 能不能抓到東西:

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

呃,失敗了,看來 node es-2 的 elasticsearch service 沒有順利跑起來。

研究了一下,發現是 network.host 設定的問題,目前這個設定是寫死的 node es-1 的 IP,但 node es-2 不能用別人的 IP 當 host 啊,所以這個設定要調整一下。

翻了一下 elastic 文件,將 elasticsearch_configurenetwork.host 更新為機器的內網 IP:

elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'development',
'network.host' => '_site_',
})
end

上傳並更新 node es-2

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * elasticsearch_service[elasticsearch] action configure
172.31.21.50 * directory[/var/run/elasticsearch-elasticsearch] action create (up to date)
172.31.21.50 * template[/etc/init.d/elasticsearch] action create (up to date)
172.31.21.50 * directory[/usr/lib/systemd/system-elasticsearch] action create (up to date)
172.31.21.50 * template[/usr/lib/systemd/system/elasticsearch.service] action create (up to date)
172.31.21.50 Recipe: elasticsearch_ik::default
172.31.21.50 * service[elasticsearch] action enable (skipped due to only_if)
172.31.21.50 * service[elasticsearch] action start (skipped due to only_if)
172.31.21.50 (up to date)
(...略)
172.31.21.50 * service[elasticsearch] action restart (skipped due to only_if)

curl 看看結果:

$ curl 172.31.21.50:9200/_cluster/health
curl: (7) Failed connect to 172.31.21.50:9200; 連線被拒絕

service 還是沒有跑起來,往上看 log,可以發現 start 被 skip 掉了。奇怪,難道我自己的 service resource 會影響到 elasticsearch_service resouce?

先把自己寫的 serivce resource 註解掉:

#service 'elasticsearch' do
# action [:restart]
# only_if { dict.updated_by_last_action? }
#end

上傳並更新 node es-2 ,結果 service 就順利跑起來了。可見我的這個 service resource 的確會影響 elasticsearch_service ,必須修理這個 bug。

4. 使用 resource notifications

google 了一陣子,發現 Chef resource 還有一個 notify 的用法,參考這篇文件,修改 remote_file resource:

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
source 'https://api.xxx.ooo/get-my-dict'
owner node['elasticsearch']['user']['username']
group node['elasticsearch']['user']['groupname']
mode '0660'
action :create
notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end

這段的意思就是當這個 remote_file resource 狀態改變的時候,就去通知 elasticsearch_service[elasticsaerch] 這個 resource 做 restart 的動作。早知道有這麼好用的東西當初在實戰 part3 的時候就不會用 only_if 的做法了。

上傳並更新 node es-2

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * remote_file[/etc/elasticsearch/analysis-ik/verybuy.dic] action create
172.31.21.50 - update content in file /etc/elasticsearch/analysis-ik/verybuy.dic from e56924 to eb541d
172.31.21.50 (current file is binary, diff output suppressed)
172.31.21.50 - restore selinux security context
172.31.21.50 Recipe: elasticsearch::default
172.31.21.50 * elasticsearch_service[elasticsearch] action restart
172.31.21.50 * service[elasticsearch] action restart
172.31.21.50 - restart service service[elasticsearch]
172.31.21.50
172.31.21.50
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 3/51 resources updated in 13 seconds

因為這時候我的詞庫檔剛好有更新,所以可以看到 restart service 這段,驗證了 notify 有順利運作;再執行一次更新 node es-2 的動作,就應該不會再 restart 一次了:

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * service[elasticsearch] action nothing (skipped due (...略)
172.31.21.50 Running handlers:
172.31.21.50 Running handlers complete
172.31.21.50 Chef Client finished, 0/49 resources updated in 11 seconds

沒錯,因為不符合 notify 條件,所以 service 沒被重啟。

不過我還要再確認一件事,就是如果原本 service 沒跑起來,執行 chef-client 之後會不會自動幫我 start。

連進 node es-2 下指令 sudo service elasticsearch stop ,然後再執行一次 chef-client

$ knife ssh 'name:es-2' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)
172.31.21.50 * service[elasticsearch] action start
172.31.21.50 - start service service[elasticsearch]
(...略)

最後用 curl 確認一下:

$ curl 172.31.21.50:9200/_cluster/health?pretty
{
"cluster_name" : "production",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

呼,到這裡總算是把 node es-2 處理好了。

5. 設定 discovery.zen.ping.unicast.hosts

但這邊又出現了一個問題,就是 health 回傳的結果中, "number_of_nodes" : 1 表示這兩台 nodes 並沒有發現彼此的存在,這是因為我還沒有設定 discovery.zen.ping.unicast.hosts 這個參數。

修改 recipes/default.rbelasticsearch_configure

elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'development',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => ['172.31.21.70', '172.31.21.50']
})
end

上傳並更新 nodes,這次我把 knife ssh 的參數改為 'name:es-*' ,可以一次更新兩台 nodes:

$ knife cookbook upload elasticsearch_ik; knife ssh 'name:es-*' 'sudo chef-client' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
(...略)

因為設定檔有更新,所以手動重啟 service:

$ knife ssh 'name:es-*' 'sudo service elasticsearch restart' --ssh-user centos --identity-file ~/.ssh/test.pem --attribute ipaddress
172.31.21.50 Restarting elasticsearch (via systemctl): [ OK ]
172.31.21.70 Restarting elasticsearch (via systemctl): [ OK ]

確認 curl 內容:

$ curl 172.31.21.70:9200/_cluster/health?pretty
{
"cluster_name" : "production",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

很好,他們成功找到彼此了。

6. 更新 discovery.zen.ping.unicast.hosts

但是這樣還是有一點小小的不舒服,因為 IP 是寫死的。如果可以動態產生這個 IP 的列表就更完美了。

google 了一下找到這個 issue,下面有一個回答看起來是就是我想要的。但我試了一下之後發現可能是版本的差異,我必須用陣列的方式設定這個參數而非字串。

再次修改 recipes/default.rbelasticsearch_configure

elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'production',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => elk_nodes
})
end

然後上傳並更新 nodes 就 OK 了,最後 nodes 裡的 /etc/elasticsearch/elasticsearch.yml 會長類似這樣:

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# THIS FILE IS MANAGED BY CHEF, DO NOT EDIT MANUALLY, YOUR CHANGES WILL BE OVERWRITTEN!
#
# Please see the documentation for further information on configuration options:
# <https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html>
#
---
cluster.name: production
node.name: es-1
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
network.host: _site_
discovery.zen.ping.unicast.hosts:
- 172.31.21.50
- 172.31.21.70

7. 小結

到這邊告一段落,我已經完成了 cluster 的設定。到現在 cookbook elasticsearch_ik 的設定如下:

  • 檔案 recipes/default.rb
package 'java-1.8.0-openjdk'include_recipe 'elasticsearch'elk_nodes = search(:node, 'name:es-*').map(&:ipaddress).sort.uniq
elasticsearch_configure 'elasticsearch' do
allocated_memory '256m'
configuration ({
'cluster.name' => 'production',
'network.host' => '_site_',
'discovery.zen.ping.unicast.hosts' => elk_nodes
})
end
elasticsearch_plugin 'analysis-ik' do
url 'https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip'
action :install
end
template '/etc/elasticsearch/analysis-ik/IKAnalyzer.cfg.xml' do
source 'IKAnalyzer.cfg.xml.erb'
end

remote_file '/etc/elasticsearch/analysis-ik/verybuy.dic' do
source 'https://api.xxx.ooo/get-my-dict'
owner node['elasticsearch']['user']['username']
group node['elasticsearch']['user']['groupname']
mode '0660'
action :create
notifies :restart, 'elasticsearch_service[elasticsearch]', :delayed
end
  • 檔案 attributes/default.rb
default['elasticsearch']['install']['version'] = '5.5.1'
  • 檔案 metadata.rb
name 'elasticsearch_ik'
maintainer 'The Authors'
maintainer_email 'you@example.com'
license 'All Rights Reserved'
description 'Installs/Configures elasticsearch_ik'
long_description 'Installs/Configures elasticsearch_ik'
version '0.2.0'
chef_version '>= 12.1' if respond_to?(:chef_version)
depends 'elasticsearch'

在這篇文章中我學到了以下幾件事:

  1. elasticsearch_configure 設定 cluster 的方法
  2. Resource notifications 蠻實用的
  3. 如何以 search 動態產生 nodes 的 IP 陣列,以後應該也可以用它來生出更多有關 nodes 的資訊到程式碼當中

這篇文章本來是預想的內容是利用 rolechef-client cookbook 來完成自動更新的設定,結果光設定 cluster 就弄了一整天加一整篇文章,真是道行太淺。下一篇應該就可以好好來設定自動更新了吧。


)
Luyo

Written by

Luyo

Founder, Developer of VeryBuy — https://www.verybuy.cc

verybuy-dev

VeryBuy 研發手札

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade