Why Fluentd stopped to send logs to ElasticSearch on Kubernetes (Related To SSL)

Published in

ukinau

17 min readMar 22, 2020

Log Forwarding/Aggregating on Kubernetes

When We use Kubernetes in the production, we need to consider how we collect Container’s logs into one place since when Container is deleted, the logs are also deleted, which make it difficult for us to investigate later when something happened

There are many possible solutions like

Log Data Store: Elastic Search, Splunk…
Log Collecting Agent: Fluentd, Fluentbit, Logstash…

Maybe for Log Data Store, Many Users are using Elastic Search since there are not so many open source text search engine.

But for Log Collecting Agent, there are many options about implementations, that’s why depending on the combinations you may experience different issue.

Mysterious Problem of Fluentd with Elastic Search Plugin Related to SSL

In this post, I would like to share very mysterious issue happened to Elastic Search, Fluentd combination. This problem will be happened to following users

All Users who use fluentd-kubernetes-daemonset to access Elastic Search over SSL
All Users who use fluent-plugin-elasticsearch with excon http_backend (default) to access over SSL
=> Even if you are not using k8s, as long as you use fluent-plugin-elasticsearch with default setting, you may experience the issue “logs are not sent to Elastic Search somehow”

If you are using fluent-plugin-elasticsearch with excon http_backend (default) and use SSL to access Elastic Search, Welcome to this post :) There is very important behavior/issue you must be aware if you don’t want to make fluentd stop sending Logs.

What’s happening

When we are talking about the issue, which is not sending Logs to Elastic Search, There may be different reason, but usually we may be able to know something happened by checking error logs in Fluentd side.

But in this time, we faced one mysterious issue which stopped to send logs but there is no error logs at all, that’s why I started to investigate/dive into this issue, and finally found very interesting behavior.

Environment / Deployment

Let me share the environment hitting issue first.

Elastic Search is deployed on just bunch of VMs and it’s out of our control. We just get API Endpoint.
Fluentd is deployed on Kubernetes with following configuration, and it’s under our control.

$ kubectl get ds
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
fluentd   3        3        3      3           3          <none>          30d$ kubectl get pod
NAME                               READY   STATUS      RESTARTS   AGE
fluentd-2mpq7                      1/1     Running     0          18h
fluentd-47jch                      1/1     Running     0          18h
fluentd-4dfpb                      1/1     Running     0          18h$ kubectl describe ds fluentd
Name:           fluentd
Selector:       app=fluentd
Node-Selector:  <none>
Labels:         app=fluentd
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 3
Number of Nodes Scheduled with Available Pods: 3
Number of Nodes Misscheduled: 0
Pods Status:  3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=fluentd
  Containers:
   fluentd:
    Image:      fluent/fluentd-kubernetes-daemonset:v1.9.2-debian-elasticsearch6-1.0
    Environment:
      FLUENTD_CONF:                      fluentd.conf
      NODE_NAME:                          (v1:spec.nodeName)
      MY_POD_NAME:                        (v1:metadata.name)

Here is versions list of software used inside fluentd-kubernetes-daemonset:v1.9.2

Ruby: v2.6.0
Fluentd: v1.9.2
Fluent-Plugin-ElasticSearch: v4.0.4
Faraday: v0.17.3
Excon: v0.73.0
OpenSSL: v1.1.1

Observations from Outside of Fluentd

Actually We faced this situation several times, and at first we just prioritized “recovering” rather than investigating. After we faced more than twice, we decided to investigate seriously, here let me describe what we found/observed by 2 sections “observation when we just did workaround”, “observation when we seriously investigated”

Day1 (Just for fixes)

1.Buffer of fluentd keep growing although there is no error logs

This is first symptom of problem. We noticed our fluentd’s buffer size keep growing, and this indicate somehow fluentd is not succeeded in flushing the logs to Elastic Search.

2. If I restart fluentd, it resume sending logs to Elastic Search

As I previously mentioned, when we face this situation first, We just restart fluentd and confirm fluentd resume sending logs.

$ docker ps | grep fluentd
df7072cbc860        fluent/fluentd-kubernetes-daemonset@sha256:a43ae47f1cadbe4e08f64345d03498a9bd83b4d38a173a97e4fc951c33e988b9              "tini -- /fluentd/..."   18 hours ago        Up 18 hours                             k8s_fluentd_fluentd-0bd6efc0-69ef-11ea-8f36-246e96d036dc_0$ docker restart df7072cbc860
=> Fluentd resume sending logs after restart

Day 2 (More detail investigation)

After 1–2 weeks since we recovered, We again observed same issue. We could recover it by just restarting, but if problem happened very often, we’d better to understand what’s root cause, otherwise we may miss something important. So this time we deeply investigated what’s happening. Let me list up what we observed additionally.

3. TCP Connection seems be established between Fluentd and Elastic Search

First thing we checked is TCP connection, and seems TCP connection has been correctly established.

$ netstat -ntp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.0.218.8:60316        192.168.0.1:9200     ESTABLISHED 19/ruby
tcp        0      0 10.0.218.8:33176        192.168.0.1:9200    ESTABLISHED 19/ruby

4. No Packet is observed over Established TCP Connection

We tried to understand what kind of packet has been emitted by fluentd, but we couldn’t observe any packet against Elastic Search

$ tcpdump -i <containers veth> port 9200
...
 => Nothing packet

5. Process Structure inside fluentd Container

That’s why our interest got moved to behavior of fluentd process and container provided by fluentd-kubernetes-daemonset, so let us check what kind of process is running inside fluentd container.

# Find Container of fluentd
$ docker ps | grep fluentd
df7072cbc860        fluent/fluentd-kubernetes-daemonset@sha256:a43ae47f1cadbe4e08f64345d03498a9bd83b4d38a173a97e4fc951c33e988b9              "tini -- /fluentd/..."   18 hours ago        Up 18 hours                             k8s_fluentd_fluentd-0bd6efc0-69ef-11ea-8f36-246e96d036dc_0# Identify Process ID of fluentd
$ docker inspect df7072cbc860 | grep \"Pid\"
 "Pid": 125153,# Check How many process are running under fluentd container 
$ ps aux | egrep "125153|125164|125185"
root     125153  0.0  0.0   8288   164 ?        Ss    3月19   0:01 tini -- /fluentd/entrypoint.sh
root     125164  0.0  0.0 149148 59364 ?        Sl    3月19   0:23 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluentd.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-4.0.4/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root     125185  0.3  0.1 310060 81340 ?        Sl    3月19   4:18 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluentd.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-4.0.4/lib/fluent/plugin/elasticsearch_simple_sniffer.rb --under-supervisor

There are 3 processes running inside fluentd container

125153: tiny init process which take similar role to init process(Pid: 1)
125164: fluentd main process which spawn actual “input plugin” “output plugin”
125185: fluentd child process which actually evaluate each “input plugin” “output plugin”

We want to confirm the logic for flushing logs to Elastic Search, since at least increasing size of buffer indicating that inputing logs and putting inputed logs into buffer works fine.

if inputing logic, putting buffer logic also had the problem, size of buffer will not be increased. based on this assumption, I started to dive into the flush related logic.

6. Identify flush_thread’s thread id (flush_thread is in charge of getting logs from buffer & sending logs to Elastic Search)

To check current behavior of flush_thread, I wanted to use strace but to do that, need to identify thread id, so tried to check thread id by following way

# Identify each thread name via lsof
# lsof give us all of opened files for each thread as bellow
# <thread_name> <pid> <thread_id> .... <file related info>
$ lsof 2>/dev/null | grep 125185
ruby_thre 125185 125186     root  cwd       DIR              0,129         6    9404149 /home/fluent
ruby_thre 125185 125186     root  rtd       DIR              0,129        39  318783328 /
ruby_thre 125185 125186     root  txt       REG              0,129    167008    1040605 /usr/local/bin/ruby
....# Get list of thread, output format is <thread_name>-<thread id>
$ lsof 2>/dev/null | grep 125185 | awk '{print $1"-"$3}' | sort | uniq
enqueue_t-125199
event_loo-125200
event_loo-125201
event_loo-125202
event_loo-125203
event_loo-125204
event_loo-125205
fluent_lo-125207
flush_thr-125197 => flush_thread thread id => 125197
flush_thr-125198 => flush_thread thread id => 125198
in_promet-125206
ruby_thre-125186
utils.rb:-125472

It turns out we have 2 flush_tread and thread id is 125197 and 125198.
This 2 is came from configuration parameter https://github.com/fluent/fluentd/blob/v1.9.2/lib/fluent/plugin/output.rb#L71, that’s why if you specify 10 in flush_thread_count, you will see 10 flush_thread.

7. All of flush_thread in Fluentd stuck with ppoll system call without TimeOut

Now let’s check what’s happening in flush_thread, We expect this thread periodically perform read system call with FD refer to buffer file, and write system call with socket which is bound for one TCP connection.

But what i observed was different from what i expected.

$ strace -p 125197
strace: Process 125197 attached
ppoll([{fd=37, events=POLLIN}], 1, NULL, NULL, 8
... # just stuck$ strace -p 125198
strace: Process 125198 attached
ppoll([{fd=21, events=POLLIN}], 1, NULL, NULL, 8
... # just stuck

Basically both of threads are just stuck with ppoll, and ppoll system call is performed with Null Timeout, even if I wait for longer than 10 min, nothing changed and just stuck.

This behavior is weird to me, since flush_thread can not do anything while waiting for ppoll finish.

8. Fd which stuck is Socket? Which TCP Connection is mapped?

We got kind of clue to reach to the root cause, so let us check more information by this clue. First question I had was where ppoll wait for IO event from, Is this waiting from File or Socket? If it’s socket, which TCP connection?

$ cd /proc/125198/fd
$ ls -l
lr-x------ 1 root root 64  3月 21 00:30 0 -> pipe:[156855933]
l-wx------ 1 root root 64  3月 21 00:30 1 -> pipe:[156855934]
l-wx------ 1 root root 64  3月 21 00:30 18 -> pipe:[156857253]
lr-x------ 1 root root 64  3月 21 00:30 19 -> anon_inode:inotify
l-wx------ 1 root root 64  3月 21 00:30 2 -> pipe:[156855935]
lrwx------ 1 root root 64  3月 21 00:31 21 -> socket:[160078904]
lrwx------ 1 root root 64  3月 21 00:30 24 -> /var/log/fluentd-k8s.pos
lr-x------ 1 root root 64  3月 21 00:30 26 -> anon_inode:inotify
lr-x------ 1 root root 64  3月 21 00:30 27 -> /var/lib/docker/containers/fce375511fc2ea3f66fb4cb3b643014d298cc44d8bba57355dc27a1bc5fbcc3a/fce375511fc2ea3f66fb4cb3b643014d298cc44d8bba57355dc27a1bc5fbcc3a-json.log
lrwx------ 1 root root 64  3月 21 00:30 3 -> anon_inode:[eventfd]
lrwx------ 1 root root 64  3月 21 00:30 37 -> socket:[156862722]
lr-x------ 1 root root 64  3月 21 00:30 38 -> pipe:[156860653]
l-wx------ 1 root root 64  3月 21 00:30 39 -> pipe:[156860653]
lrwx------ 1 root root 64  3月 21 00:30 4 -> anon_inode:[eventfd]

We can find fd:21 and fd:37 are indicating to socket(socket:[160078904], socket:[156862722]), that’s why we now understood both flush_thread wait for io event from socket, so in this case, which connection is mapping to these?

$ cd /proc/125198/net
$ cat tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: <container ip>:<container port> <elastic search ip>:<elastic search port> 01 00000000:00000000 00:00000000 00000000     0        0 156862722 1 ffffa0578faba6c0 20 4 3 17 16
   1: <container ip>:<container port> <elastic search ip>:<elastic search port> 01 00000000:00000000 00:00000000 00000000     0        0 160078904 1 ffffa0578faba6c0 20 4 3 17 16

You can find mapped tcp connection in /proc/125198/net/tcp by inode like socket:[156862722] => connection number 0, socket:[160078904] => connection number 1.

here actually ip address is expressed in hex, that’s why it’s easy to use simple decode script like https://gist.github.com/jkstill/5095725 .

It turns out both of FD are the socket of TCP connection against Elastic Search, which means, both of flush_threads are waiting IO event from TCP connection against Elastic Search for ever.

Summary of Observation

I checked all of the states related to fluentd process which stuck. There are 3 key points.

Restarting Fluentd solved this problem
There is no packet confirmed with tcpdump although TCP connection has been established
All flush_threads stuck with ppoll system call waiting for IO event from TCP connection against Elastic Search forever (No timeout)

This time unfortunately I couldn’t login/touch to Elastic Search (Let’s suppose we use Managed Service), that’s why all of observations are from Fluentd Host perspective.

Hypothesis

From above observation, I built up following hypothesis

Elastic Search probably close/drop Established TCP connection against our Fluentd Host without sending TCP RST or TCP FIN like Server Rebooting
Fluentd Host have still TCP Connection with Established State since we didn’t get TCP RST or TCP FIN from Elastic Search
Fluentd Host didn’t try to send something after Elastic Search drop the connection, that’s why we have no chance to notice that TCP Connection has somehow gone in remote-side
There may be the code to call ppoll with No Timeout inside Fluent-Plugin-ElasticSearch. The reason we could say it’s in Fluent-Plugin-ElasticSearch is that both threads wait for IO event from connection of ElasticSearch.
flush_thread wait for IO event from already closed/un-exist connection, which cause flush_thread stopped to work.

If this hypothesis is true, root cause of problem is the logic to call ppoll with No Timeout since remote host may take a time to process, may hung during request processing, may reboot during SSL handshake…that’s why we should not call ppoll with No Timeout.

Analyze Observation / Understand Observations

To prove hypothesis and find exact root cause, we need 2 things to clarify additionally.

Who/How ppoll is called (Is this really by Fluent-Plugin-ElasticSearch?)
What If Elastic Search got crush in the middle of SSL Handshake? (Is Fluentd really stuck with ppoll?)

So let’s dive into them.

Who/How ppoll is called

To know Who/How ppoll is called, I explored the implementation and actual behavior to send logs to Elastic Search by strace, systemtap, GDB. I will introduce how I investigated in next post, here let me describe result of investigation.

This is whole picture of how Fluent-Plugin-ElasticSearch send to the logs

This diagram is designed to explain details around logic to send logs Elastic Search by Fluentd, that’s why other parts is not clearly mentioned here.

Let me explain key points starting from Fluent-Plugin-ElasticSearch in above diagram.

Fluent-Plugin-ElasticSearch use Faraday Library as a HTTP Client
Faraday is kind of abstracted HTTP Client, which allow us to use different implementation of HTTP Client (Each implementation is called as a adapter)

Fluent-Plugin-ElasticSearch use Excon as a default adapter for Faraday
https://github.com/uken/fluent-plugin-elasticsearch/blob/v4.0.4/lib/fluent/plugin/out_elasticsearch.rb#L144

143 config_param :include_index_in_url, :bool, :default => false    144 config_param :http_backend, :enum, list: [:excon, :typhoeus], :default => :excon
145 config_param :validate_client_version, :bool, :default => false    146 config_param :prefer_oj_serializer, :bool, :default => false

Excon adapter in Faraday force us to use blocking mode due to known issue of Excon which is related to jruby when we use SSL https://github.com/lostisland/faraday/blob/v0.17.3/lib/faraday/adapter/excon.rb#L22-L24

Excon HTTP client support different transport implementation like Unix Socket, TCP Socket, SSL Socket, they are automatically chosen based on URI schema

Excon SSLSocket is kind of wrapper for SSLSocket provided by Ruby Standard Library which internally rely on openssl shared library

Ruby SSLSocket use openssl library as a protocol implementation

Ruby SSLSocket provide completely different method for non-blocking connect, read, write from blocking methods. SSLSocket is not deciding whether blocking or non-blocking by attribute of SSLSocket.

Ruby SSLSocket set non-blocking flag to TCPSocket for Transport when SSLSocket object is created, which means even for SSLSocket.connect(blocking) method, Underlying TCP Socket is working as a non-blocking mode
That’s why openssl layer’s socket operation like SSL_connect, SSL_read has been performed as a non-blocking operation always

In the case of SSLSocket.connect_nonblocking method in Ruby, SSLSocket will return processing to caller immediately with translated IO Error(:wait_readable, :wait_writable) from openssl’s non-blocking error code like( SSL_ERROR_WANT_WRITE, SSL_ERROR_WANT_READ)
In the case of SSLSocket.connect (blocking) method in Ruby, still openssl return processing to Ruby’s SSLSocket immediately with openssl’s non-blocking error code, but Ruby’s SSLSocket will not return processing to caller, instead it will wait for IO event to achieve blocking behavior
Only after/when openssl’s operation has been completed, SSLSocket will return processing to caller. For example, SSLSocket.connect will return processing to caller only when SSL handshake has been finished or SSL handshake has been recognized as a fail in openssl layer

When SSLSocket wait for IO event, it use rb_io_wait_readable, rb_io_wait_writable in io.c, they eventually call ppoll with Null Timeout always

Fluent-Plugin-ElasticSearch use Faraday and Faraday instantiate Excon::SSLSocket with blocking option, that’s why Excon::SSLSocket will call SSL::SSLSocket’s blocking method(connect, read, write). SSL::SSLSocket.connet would call ppoll with Null Timeout as above described

Now we know who/how ppoll is executed with Null Timeout inside Fluent-Plugin-ElasticSearch, It’s actually called by SSLSocket.read, SSLSocket.write, SSLSocket.connect in Ruby.

Although there is configuration about timeout, read_timeout, write_timeout in Fluent-Plugin-ElasticSearch, Faraday, Excon, they will not be correctly evaluated in the case of “blocking” since from outside of SSL::SSLSocket, it’s out of control how long it will wait as you can see above diagram, but in other case “non-blocking”, these timeout will be correctly evaluated, and that’s implemented in the layer of Excon.

This is possible because Ruby’s SSLSocket always return processing to Excon immediately, that’s why how long we will wait is under Excon control.

What If Elastic Search got crush in the middle of SSL Handshake?

Now that we understand “who/how ppoll is evaluated”, so next question is that is it really trigger “ppoll stuck” when Elastic Search got crush in the middle of SSL Handshake.

To simply emulate this situation, I used simple SSL Client and TCP Server instead of actual Fluentd and Elastic Search.

We just wanted to confirm HTTPS(SSL::SSLClient) will stuck or not, when HTTPS Server got down during SSL Handshake, that’s why this emulation environment must be enough for this purpose

SSLClient: Suppose Fluentd

require 'socket'
require 'openssl'
include OpenSSLsoc = TCPSocket.new('192.168.1.1', 5000)
ssl = SSL::SSLSocket.new(soc)
ssl.connect
print “SSL Connected”
ssl.close
soc.close

TCPServer: Suppose Elastic Search, the reason why we need to use TCPServer is to emulate the situation that SSL Server doesn’t respond to SSL Handshake

require 'socket'server = TCPServer.open("0.0.0.0", 5000)
begin
    while connection = server.accept
        while line = connection.gets
            puts line
        end
    end
rescue Errno::ECONNRESET, Errno::EPIPE => e
    puts e.message
    retry
end

Procedure of Experiment is followings

Run TCP Server on Server2
Run SSL Client on Server1
Check SSL State on Server1
Check1: ssl.connect is blocking
Check2: ppoll is stuck
Check3: TCP session is active
Hard Reboot Server2
Check SSL State on Server1
Expect same result as №3 step

So let’s check result

Run TCP Server on Server2(Fake Elastic Search)

$ ruby tcp-server.rb

2. Run SSL Client on Server1(Fake Fluentd)

$ ruby ssl-client.rb

3. Check SSL State on Server1(Fake Fluentd)

# Check 1. ssl.connect is blocking
$ ruby ssl-client.rb# => it didn't print anything. if ssl.connect is finished, we will be able to see "SSL Connected" message, that's why we could say ssl.connect is blocking# Check 2. ppoll is stuck
$ ps ax | grep ssl-client.rb
106454 pts/2    S+     0:00 ruby ssl-client.rb$ strace -p 106454
strace: Process 106454 attached
ppoll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, NULL, NULL, 8 
# => ppoll is stuck with No Timeout# Check 3. TCP Connection is established 
$ netstat -aln | grep 5000
tcp        0      0 10.1.6.2:31678    192.168.1.1:5000     ESTABLISHED
# => It's established

4. Hard Reboot Server2(Fake Elastic Search)

5. Check SSL State on Server1(Fake Fluentd)

still same as №3 check result.

This experiment proved that Ruby’s SSL::SSLSocket would stuck with ppoll if remote host didn’t reply to SSLHandshake. Even worse if remote host got rebooted or down, The server running SSL::SSLSocket still recognized TCP Connection as Established although it’s gone already, and keep ppoll system call stuck.

Root Cause

By exploring code, and experiment, We finally found the root cause of problem

Fluent-Plugin-ElasticSearch call ppoll with No Timeout via Faraday->Excon->Excon::SSLSocket-> SSL::SSLSocket when we use SSL
If Elastic Search Server got rebooted in the middle of SSL Handshake, Fluent-Plugin-ElasticSearch(SSL::SSLSocket) got stuck with ppoll system call while TCP Connection recognized as a Establish (But obviously Elastic Search Server doesn’t have TCP connection at that time due to reboot)
Due to process hang with ppoll, flush_thread(Fluent-Plugin-ElasticSearch) stopped to send logs to Elastic Search.

Solutions for this problem

So finally we can discuss fundamental approach to fix this.
It’s impossible to make sure Elastic Search will not drop any TCP Connection since there are many variety of case to trigger reboot or TCP packet loss.

That’s why direction of fix is to avoid ppoll system call with No Timeout.
Let me list up possible solutions. before check all of solutions, just note that All of solution imply “Don’t use SSL::SSLSocket’s blocking method in Ruby Standard Library Since It may hang”

Change Ruby: Introduce SSL::SSLSocket Timeout, and pass it into ppoll system call in Ruby Standard Library, and then re-compile ruby
Change Faraday: Use SSL::SSLSocket non-blocking method by modifying Faraday Excon Adapter, but it may have compatibility problem with jRuby mentioned inside comment (faraday/adapter/excon.rb#L22-L23)
Change Fluent-Plugin-ElasticSearch: Customize Faraday Adapter’s Logic to create Excon::SSLSocket with nonblock=true by making use of feature of adapter customization(issues/810#issuecomment-412569974)
Change Excon: Don’t use SSL::SSLSocket(Ruby Standard Library) blocking method(connect, write, read) at all. When nonblock=false is specified, instead of let SSL::SSLSocket block, Excon is better to do ppoll or select to achieve blocking with timeout. Actually this type of modification is required for HTTP as well to respect to timeout configuration since Excon doesn’t respect to timeout in the case of nonblock=false, because it just use blocking method of underlying socket provided by Ruby Standard Library. although latter part is kind of different problem, but better to modify in that way.
Change Config of Fluent-Plugin-ElasticSearch: Use typhoeus as an adapter of Faraday by specifying it in config of Fluent-Plugin-ElasticSearch instead of Excon. Since typhoeus is not using SSL::SSLSocket (Ruby Standard Library) and use libcurl which implement SSL inside it, this problem is not happened

Solution No1 is very expensive, since we need to re-compile ruby and need to use. No4 is not that expensive but it will not be few line of code, and kind of big change that’s why it will still take a time to merge it into upstream master.

Comparing with them. No2, No3, No5 is very easy to change No2, No3 is just few line of change, and even for No5, that’s just config change. but No5, we may face new problem which is not happened to Excon backend.

So I’m thinking to overcome this problem with No3 solution, and make a very simple Pull Request at the moment, but for No1, No4 I will prepare solution when I have a time, but at least I’m thinking to participate in discussion.

Here is my PR of No3 solution, If you are interested in.

3. Change Fluent-Plugin-ElasticSearch: https://github.com/uken/fluent-plugin-elasticsearch/pull/733

Although this post is much big, Thanks for reading all.
I hope this post will help you somehow.

Appendixes

Related Issues

After I understood the whole picture and root cause, I also got idea of keyword to search related issue. I was interested in how others are responding to this problem, and what is the state of community response for this (known issue? fix in progress?…)

Ruby

https://redmine.ruby-lang.org/issues/15729

Excon

https://github.com/excon/excon/issues/682
https://github.com/excon/excon/issues/364
Timeout is ignored when nonblock=false: https://github.com/excon/excon/issues/682

Faraday

Fluent-Plugin-ElasticSearch

Interesting result to me, problem of SSL::SSLSocket blocking method invocation in Ruby Standard Library affect many different part of software, like Excon, Faraday, Fluent-Plugin-ElasticSearch…

May be for opened issue, I can leave a comment about what I found here.

Reference

Although I didn’t explain “How to debug/clarify big picture of the logic to send logs to Elastic Search” in this post, I was very helped by many different references to proceed troubleshooting. So let me list up all materials I refer.

Thanks for all of smart developers

Decode /proc/XXX/net/tcp: https://gist.github.com/jkstill/5095725
SSL Programming with OpenSSL: http://h30266.www3.hpe.com/odl/axpos/opsys/vmsos84/BA554_90007/ch04s03.html
SSL_connect in OpenSSL oficial doc: https://www.openssl.org/docs/man1.0.2/man3/SSL_connect.html
Japanese: SSL Programming with OpenSSL: https://qiita.com/yoshida-jk/items/fc5f8357adcbcbf6044a
Japanese: SSL Handshake Programming of SSL_connect: http://home.att.ne.jp/theta/diatom/SSL_connect%283%29.html
Japanese: Installing Ruby: https://www.ruby-lang.org/en/documentation/installation/
Ruby C-extension Tutorial: https://docs.ruby-lang.org/en/2.6.0/extension_rdoc.html
Japanese: Ruby C-Extension Example(Exception): http://blog.livedoor.jp/sonots/archives/42168818.html
Ruby SSLSocket Official Doc: https://docs.ruby-lang.org/ja/2.6.0/class/OpenSSL=3a=3aSSL=3a=3aSSLSocket.html
Ruby Socket Official Doc: https://docs.ruby-lang.org/en/2.6.0/Socket.html
Non-block Socket Sample: https://jameshfisher.com/2017/04/05/set_socket_nonblocking/
Japanese: Socket Timeout Configuration SO_SNDTIMEO, SO_RECVTIMEO:https://sleepy-yoshi.hatenablog.com/entry/20111113/p1
man ip(7): http://man7.org/linux/man-pages/man7/ip.7.html
man ppoll(2): https://linux.die.net/man/2/ppoll
Japanese: GDB Command Cheat Sheet: https://watson.hatenadiary.org/entry/20100318/1268887029
GDB Threads: https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html
Japanase: System Tap:
https://qiita.com/hana_shin/items/9b265b4f9a51f98d0f4d
https://qiita.com/hana_shin/items/6bca693206bf5f887cb3
https://qiita.com/hana_shin/items/0f82ea9d8bc8ed5fd70b
Japanese: Install New Shared Object(ld, ldd, ldconfig) https://qiita.com/Esfahan/items/0064d845ca6faf7f3d47
Japanese: How to install OpenSSL 1.1.1 https://qiita.com/Esfahan/items/4e2002f4a24589b2d0bf
Japanese: How to check OpenSSL version of libssl.so https://masutaka.net/chalow/2014-04-08-1.html
SSL Handshake Procedure: https://qiita.com/n-i-e/items/41673fd16d7bd1189a29
OpenSSL Related: https://stackoverflow.com/questions/3952104/how-to-handle-openssl-ssl-error-want-read-want-write-on-non-blocking-sockets
SSL_connect doc: https://www.openssl.org/docs/man1.0.2/man3/SSL_connect.html