<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Stef Nestor on Medium]]></title>
        <description><![CDATA[Stories by Stef Nestor on Medium]]></description>
        <link>https://medium.com/@stefnestor?source=rss-3884b0aa8da5------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*HM0ZrePu-eloekDjdZwMNw.png</url>
            <title>Stories by Stef Nestor on Medium</title>
            <link>https://medium.com/@stefnestor?source=rss-3884b0aa8da5------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 30 May 2026 16:47:18 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@stefnestor/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Elastic Node Hot Threads]]></title>
            <link>https://medium.com/@stefnestor/elastic-node-hot-threads-370111fab4e7?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/370111fab4e7</guid>
            <category><![CDATA[tasks]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <category><![CDATA[opensearch]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Wed, 11 Jun 2025 19:16:04 GMT</pubDate>
            <atom:updated>2025-06-12T16:50:35.070Z</atom:updated>
            <content:encoded><![CDATA[<p>Troubleshooting with and interpreting Elasticsearch <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads">Node Hot Threads</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ixsjv-9qMz51V-Zaz8vmJQ.png" /><figcaption>Node Hot Threads API response</figcaption></figure><p>I previously covered investigating <a href="https://medium.com/@stefnestor/use-method-for-elasticsearch-d976802d8ba6">Elasticsearch via SRE’s USE Method</a>. From there, I flushed out Elastic’s <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/fix-common-cluster-issues">Common Issues</a> which systematically covers those sequential top resource concerns for administrating Elastic: <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/fix-watermark-errors">disk watermark</a> (not covered here), then <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage">CPU usage</a>, then <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/high-jvm-memory-pressure">JVM heap</a>, and then <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/task-queue-backlog">task throughput</a>. <em>TL;DR of p</em>rior: hardware CPU+JVM resource usage influences software task throughput (and vice-versa), but these API outputs can only be correlated not one-to-one associated.</p><p>Today we’ll dive into interpreting <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads">Node Hot Threads</a> (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.18/cluster-nodes-hot-threads.html">old link</a>) as part of investigating any of these three common issues.</p><p>Elasticsearch runs open source Lucene which runs a <a href="https://www.elastic.co/docs/reference/elasticsearch/jvm-settings">JDK</a>. The JDK’s sub-JVM allows for polling Java (not direct CPU) threads and heap dumps. (<em>Side</em>: See <a href="https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html?source=post_page-----ab95b6a68653---------------------------------------">Oracle’s GC explanation</a> for why we check CPU/threads first then heap after.) These outputs frequently point to high-volume/expensive code paths as referenced by <a href="https://www.elastic.co/docs/deploy-manage/monitor/logging-configuration/update-elasticsearch-logging-levels">Elasticsearch loggers</a> (<a href="https://gist.github.com/stefnestor/c508c23f305258723d49b915d684456d">example list</a>, usually prefixing org.elasticsearch).</p><h3>API Response</h3><p>The Elasticsearch service can run multiple Java threads per single hardware CPU thread within an Elastic-defined <a href="https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/thread-pool-settings">thread pool</a> or default transport_worker pool (see previous <a href="https://medium.com/@stefnestor/elasticsearch-tasks-a77f6b0cb558">Elasticsearch Tasks</a>). To enable responsiveness on even the most struggling nodes, Elastic returns a pretty unsophisticated response of just an unsorted snapshot list of stacktraces noting repeat counts. So the API response examples and templates like …</p><pre># GET _nodes/hot_threads<br><br># example<br>::: {instance-0000000001}{9fVI1XoXQJCgHwsOPlVEig}{RrJGwEaESRmNs75Gjs1SOg}{instance-0000000001}{10.42.9.84}{10.42.9.84:19058}{himrst}{8.18.2}{7000099-8525000}{region=unknown-region, server_name=instance-0000000001.b84ab96b481f43d791a1a73477a10d40, xpack.installed=true, transform.config_version=10.0.0, ml.config_version=12.0.0, data=hot, logical_availability_zone=zone-1, availability_zone=us-central1-a, instance_configuration=gcp.es.datahot.n2.68x10x45}<br>   Hot threads at 2025-05-14T17:59:30.199Z, interval=500ms, busiestThreads=10000, ignoreIdleThreads=true:<br>   <br>   88.5% [cpu=88.5%, other=0.0%] (442.5ms out of 500ms) cpu usage by thread &#39;[write]&#39;<br>     8/10 snapshots sharing following 29 elements<br>       com.fasterxml.jackson.dataformat.smile@2.17.2/com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:434)<br>       org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doAdd(LocalBulk.java:69)<br>       # ... <br>     2/10 snapshots sharing following 37 elements<br>       app/org.elasticsearch.xcontent@8.16.1/org.elasticsearch.xcontent.support.filtering.FilterPath$FilterPathBuilder.insertNode(FilterPath.java:172)<br>       # ... <br><br><br># template<br>::: {NAME}{ID}{UNK}{HOST_NAME}{ADDRESS}{UNK}{ROLES}{VERSION}{UNK}{ATTRIBUTES}<br>   Hot threads at TIMESTAMP, interval=INTERVAL_FROM_API, busiestThreads=THREADS_FROM_API, ignoreIdleThreads=IDLE_FROM_API:<br>   <br>   TOTAL_CPU% [cpu=ELASTIC_CPU%, other=OTHER_CPU%] (Xms out of INTERVAL_FROM_API) cpu usage by thread &#39;THREAD&#39;<br>     X/SNAPSHOTS_FROM_API snapshots sharing following X elements<br>       STACKTRACE_SAMPLE<br>       # ... <br>     X/SNAPSHOTS_FROM_API snapshots sharing following X elements<br>       STACKTRACE_SAMPLE<br>       # ...</pre><p>… where most of this output reports the API’s inputs/defaults and then literal stacktrace samples which we’ll ignore here. What we care to note is</p><ul><li>the first row of ::: reports the node’s name and roles</li><li>the first thread reports the thread info as write thread pool related</li><li>CPU shows up three times as (unofficial jargon:) as total CPU, Elasticsearch-used CPU, and “other” CPU (for disk/network IO and/or GC)</li><li>the logger ended up being org.elasticsearch.xpack.monitoring.exporter.local so someone’s doing <a href="https://www.elastic.co/docs/deploy-manage/monitor/stack-monitoring/es-legacy-collection-methods">legacy local monitoring</a></li></ul><p>Confirming from this output that within the requested time frame the Java thread may have responded to multiple tasks (<a href="https://github.com/elastic/elasticsearch/issues/74580#issuecomment-915155933">reference</a>) so this output cannot report the direct task ID (<a href="https://github.com/elastic/elasticsearch/issues/127847">reference</a>), but it is still helpful for generally knowing where you spend the majority of your CPU time.</p><h3>While Troubleshooting</h3><p>Some commentary on integrating this output while troubleshooting</p><h4><strong>CPU</strong></h4><p>If Elastic-used CPU remains +95% (<a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/high-cpu-usage">high CPU usage</a>), you expect to see correlating threads (even if they rotate) as CPU can’t be high without active code being ran. I find either it’s an expensive task (so returns on first poll, see later example) or it’s a spattering of high-intensity but fast tasks (so you may have to poll a couple times in quick succession to catch).</p><h4><strong>CPU-to-GC</strong></h4><p>As elevated CPU may trigger GC (if only via backup of JVM hitting 95%), you’ll expect other to be elevated along with Elasticsearch logs reporting multiple garbage collection cycles. Noting for descending threads-per-node, the time in other will usually reflect higher. <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/high-jvm-memory-pressure">Elevated JVM heap percent</a> (<a href="https://github.com/elastic/elasticsearch/pull/112715">ignore ram percent</a>) won’t fall until GC successfully reaps the heap.</p><p>If CPU usage is consistently low from both expensive and high-intensity fast tasks but <a href="https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory#time-to-scale">JVM Memory Pressure</a> remains high, this is when you’d suspect an unreaped task (for example <a href="https://github.com/elastic/elasticsearch/issues/106543">≤v8.13 searches not always reap</a>) or potential memory leak. In my experience, I’m usually wrong that CPU symptoms don’t show but on the edge case where they really don’t show, this is when I pull+analyze a JVM heap dump.</p><h4><strong>CPU-not-GC</strong></h4><p>If Elasticsearch CPU is low but other and total CPU are high then you’ll be looking for a disk/network IO issue. In my experience, usually this will reflect outside of Elasticsearch as the entire host struggle-bussing an obvious disk/network issue.</p><p>The only main exception to obvious disk/network IO that I’m aware of is <a href="https://docs.dynatrace.com/docs/discover-dynatrace/platform/oneagent">Dynatrace has</a> oneagentautosensor which can eat up all available CPU (usually during performance issues as it doesn’t always back off polling, which is ironic) who’s thread samples look like …</p><pre>   100.0% [cpu=0.6%, other=99.4%] (500ms out of 500ms) cpu usage by thread &#39;oneagentautosensor&#39;<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>     unique snapshot<br>   <br>   100.0% [cpu=0.1%, other=99.9%] (500ms out of 500ms) cpu usage by thread &#39;oneagentsubpathsender REDACTED&#39;<br>     # same unique snapshot repeat before<br>   <br>   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread &#39;oneagentperiodicrequests&#39;<br>     # same unique snapshot repeat before<br>   <br>   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread &#39;oneagentallocationprofiling&#39;<br>     # same unique snapshot repeat before</pre><p>… and at which point your only answer is to disable Dynatrace monitoring until you get the Elasticsearch node stable.</p><h4>Logger-to-Tasks</h4><p>For high CPU tasks, sometimes it’s helpful to compare these against the <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/task-queue-backlog#diagnose-task-queue-long-running-node-tasks">Long-running Node Tasks</a>. The analysis is correlative but cannot be lined-up one-to-one, but is usually quite good at finding expensive code path usage.</p><p>The most common built-in one in my experience is org.elasticsearch.search.aggregations.bucket.composite indicating a Composite Aggregations within a Search task even though <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-composite-aggregation">its documentation</a> has a full performance “you must seriously load test this” disclaimer warning</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fVbBi6jb1Z_7Icfz_Azj0Q.png" /></figure><p>If this logger flagged, per <a href="https://www.elastic.co/docs/troubleshoot/elasticsearch/task-queue-backlog#diagnose-task-queue-long-running-node-tasks">Long-running Node Tasks</a> you would expect to cross-compare this to current searches …</p><pre>GET _tasks?human=true&amp;detailed=true&amp;actions=indices:data/read/search</pre><p>… to find out who’s output description includes composite in its JSON.</p><p>The most common not-built-in one AFAIK is Runtime custom code (from <a href="https://www.elastic.co/docs/manage-data/data-store/mapping/define-runtime-fields-in-search-request">search</a> or <a href="https://www.elastic.co/docs/manage-data/data-store/mapping/map-runtime-field">mapping</a>) flagged via logger org.elasticsearch.search.runtime.</p><h3>Analysis Automations</h3><p>The following is offered as is with no guarantees and no upkeep. It’s been used against v7.10-v9.0. This is an extraction of my current Python object <strong>[^A]</strong> to tag common features from a combination substring search across the thread and stacktraces.</p><p>If you were prone to put this into a <a href="https://docs.streamlit.io/library/api-reference">Streamlit UI</a> (for filtering-ease), then for an example frozen tier having future dates inducing high searches from Kibana Rules would appear like …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wu7N2ytZ75ty7glPp3TjMw.png" /></figure><p>… where Frozen nodes are doing 99–100% CPU for searches (and some aggregation searches) while hot nodes are doing far less.</p><p><strong>[^A] </strong>A wrapping function would say if <em>every</em> substring in strs is found in either the thread or its stacktrace, then tag the thread as relating to said feature. Feature names are close to official Elastic documentation but are kind of just figured out based on need/frequency.</p><pre>TAG_ANALYSIS = [<br>    {&quot;tag&quot;: &quot;alias&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.indices.alias&quot;]},<br>    {&quot;tag&quot;: &quot;alias&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.aliases&quot;]},<br>    {&quot;tag&quot;: &quot;alias&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.cluster.metadata.Metadata.findAliases&quot;], },<br>    {&quot;tag&quot;: &quot;alias&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.alias&quot;]},<br>    {&quot;tag&quot;: &quot;allocation&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.cluster.routing.allocation.allocator&quot;], },<br>    {&quot;tag&quot;: &quot;allocation&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.cluster.routing.allocation.decider&quot;], },<br>    {&quot;tag&quot;: &quot;allocation.desired&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator&quot; ], },<br>    {&quot;tag&quot;: &quot;analysis&quot;, &quot;strs&quot;: [&quot;org.apache.lucene.analysis&quot;]},<br>    {&quot;tag&quot;: &quot;apm&quot;, &quot;strs&quot;: [&quot;elastic-apm-server-reporter&quot;]},<br>    {&quot;tag&quot;: &quot;ccr&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.ccr&quot;]},<br>    {&quot;tag&quot;: &quot;dlm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.datastreams.lifecycle&quot;]},<br>    {&quot;tag&quot;: &quot;dlm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.datastreams.lifecycle&quot;]},<br>    {&quot;tag&quot;: &quot;downsample&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.downsample&quot;]},<br>    {&quot;tag&quot;: &quot;downsample&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.downsample&quot;]},<br>    {&quot;tag&quot;: &quot;downsample&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.downsample&quot;]},<br>    {&quot;tag&quot;: &quot;enrich&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.enrich&quot;]},<br>    {&quot;tag&quot;: &quot;enrich&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.enrich&quot;]},<br>    {&quot;tag&quot;: &quot;evictor&quot;, &quot;strs&quot;: [&quot;Connection evictor&quot;]},<br>    {&quot;tag&quot;: &quot;fields&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.fieldcaps&quot;]},<br>    {&quot;tag&quot;: &quot;fields&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.fielddata&quot;]},<br>    {&quot;tag&quot;: &quot;fields&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.indices.fielddata&quot;]},<br>    {&quot;tag&quot;: &quot;fields&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.fieldcaps&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;[flush]&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.indices.flush&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.Engine.flush&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.flush&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.shard.IndexShard.flush&quot;]},<br>    {&quot;tag&quot;: &quot;flush&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.indices.flush&quot;]},<br>    {&quot;tag&quot;: &quot;forcemerge&quot;, &quot;strs&quot;: [&quot;[force_merge]&quot;]},<br>    {&quot;tag&quot;: &quot;forcemerge&quot;, &quot;strs&quot;: [&quot;org.apache.lucene.index.IndexWriter.forceMerge&quot;]},<br>    {&quot;tag&quot;: &quot;geoip&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.geoip&quot;]},<br>    {&quot;tag&quot;: &quot;geoip&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.ingest.geoip&quot;]},<br>    {&quot;tag&quot;: &quot;geoip&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.geoip&quot;]},<br>    {&quot;tag&quot;: &quot;get&quot;, &quot;strs&quot;: [&quot;[get]&quot;]},<br>    {&quot;tag&quot;: &quot;get&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.InternalEngine.get&quot;]},<br>    {&quot;tag&quot;: &quot;get&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.get.ShardGetService.get&quot;]},<br>    {&quot;tag&quot;: &quot;grok&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.grok&quot;]},<br>    {&quot;tag&quot;: &quot;ilm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.indexlifecycle&quot;]},<br>    {&quot;tag&quot;: &quot;ilm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.ilm&quot;]},<br>    {&quot;tag&quot;: &quot;ilm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.ilm&quot;]},<br>    {&quot;tag&quot;: &quot;ingest&quot;, &quot;strs&quot;: [&quot;[write]&quot;]},<br>    {&quot;tag&quot;: &quot;ingest.delete&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.InternalEngine.delete&quot;], },<br>    {&quot;tag&quot;: &quot;ingest.delete&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.shard.IndexShard.applyDeleteOperation&quot;], },<br>    {&quot;tag&quot;: &quot;ingest.mapping&quot;, &quot;strs&quot;: [&quot;[write]&quot;, &quot;org.elasticsearch.index.mapper&quot;, &quot;parseCreateField&quot;], },<br>    {&quot;tag&quot;: &quot;ingest.mapping&quot;, &quot;strs&quot;: [&quot;[write]&quot;, &quot;org.elasticsearch.index.mapper.ObjectMapper$Builder.build&quot;], },<br>    {&quot;tag&quot;: &quot;keepalive&quot;, &quot;strs&quot;: [&quot;keepAlive&quot;]},<br>    {&quot;tag&quot;: &quot;logdb&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.logsdb&quot;]},<br>    {&quot;tag&quot;: &quot;logdb&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.logsdb&quot;]},<br>    {&quot;tag&quot;: &quot;logging&quot;, &quot;strs&quot;: [&quot;Log4j2&quot;]},<br>    {&quot;tag&quot;: &quot;logging&quot;, &quot;strs&quot;: [&quot;org.apache.logging&quot;]},<br>    {&quot;tag&quot;: &quot;management&quot;, &quot;strs&quot;: [&quot;[management]&quot;]},<br>    {&quot;tag&quot;: &quot;merge&quot;, &quot;strs&quot;: [&quot;Lucene Merge Thread&quot;]},<br>    {&quot;tag&quot;: &quot;merge&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.indices.forcemerge&quot;]},<br>    {&quot;tag&quot;: &quot;merge&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.merge&quot;]},<br>    {&quot;tag&quot;: &quot;ml&quot;, &quot;strs&quot;: [&quot;ml-cpp&quot;]},<br>    {&quot;tag&quot;: &quot;ml&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.ml&quot;]},<br>    {&quot;tag&quot;: &quot;ml&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.ml&quot;]},<br>    {&quot;tag&quot;: &quot;ml&quot;, &quot;strs&quot;: [&quot;x-pack-ml&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;inference_utility&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.inference&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.inference&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.ml.inference&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.inference&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.ml.inference&quot;]},<br>    {&quot;tag&quot;: &quot;ml.inference&quot;, &quot;strs&quot;: [&quot;xpack.inference&quot;]},<br>    {&quot;tag&quot;: &quot;monitoring&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.cluster.stats&quot;]},<br>    {&quot;tag&quot;: &quot;monitoring.cluster&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.cluster.stats&quot;], },<br>    {&quot;tag&quot;: &quot;pending_task&quot;, &quot;strs&quot;: [&quot;clusterApplierService#updateTask&quot;]},<br>    {&quot;tag&quot;: &quot;pending_task&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.cluster.service.ClusterApplierService.applyChanges&quot; ], },<br>    {&quot;tag&quot;: &quot;pending_task&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.ilm.IndexLifecycleTransition.newClusterStateWithLifecycleState&quot; ], },<br>    {&quot;tag&quot;: &quot;pipeline&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.ingest.CompoundProcessor&quot;]},<br>    {&quot;tag&quot;: &quot;pipeline&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.ingest.Pipeline&quot;]},<br>    {&quot;tag&quot;: &quot;pipeline.if&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.ingest.ConditionalProcessor&quot;]},<br>    {&quot;tag&quot;: &quot;pipeline.script&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.ingest.common.ScriptProcessor&quot;], },<br>    {&quot;tag&quot;: &quot;reaper&quot;, &quot;strs&quot;: [&quot;process reaper&quot;]},<br>    {&quot;tag&quot;: &quot;refresh&quot;, &quot;strs&quot;: [&quot;[refresh]&quot;]},<br>    {&quot;tag&quot;: &quot;refresh&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.indices.refresh&quot;]},<br>    {&quot;tag&quot;: &quot;refresh&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.InternalEngine.refresh&quot;], },<br>    {&quot;tag&quot;: &quot;refresh&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.refresh&quot;]},<br>    {&quot;tag&quot;: &quot;reindex&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.reindex&quot;]},<br>    {&quot;tag&quot;: &quot;rollup&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.rollup&quot;]},<br>    {&quot;tag&quot;: &quot;rollup&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.rollup&quot;]},<br>    {&quot;tag&quot;: &quot;runtime&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.runtimefields&quot;]},<br>    {&quot;tag&quot;: &quot;script&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.script&quot;]},<br>    {&quot;tag&quot;: &quot;script.mustache&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.script.mustache&quot;]},<br>    {&quot;tag&quot;: &quot;script.painless&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.painless.PainlessScript&quot;]},<br>    {&quot;tag&quot;: &quot;script.painless.date&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.script.DateFieldScript&quot;], },<br>    {&quot;tag&quot;: &quot;script.regex&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.common.regex&quot;]},<br>    {&quot;tag&quot;: &quot;scroll&quot;, &quot;strs&quot;: [&quot;.scroll&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.search.&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.common.lucene.search&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.query&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.search.&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.query.&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search&quot;], },<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.search.&quot;]},<br>    {&quot;tag&quot;: &quot;search&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.search.&quot;]},<br>    {&quot;tag&quot;: &quot;search.agg&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.aggregations&quot;]},<br>    {&quot;tag&quot;: &quot;search.agg&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.aggregations&quot;]},<br>    {&quot;tag&quot;: &quot;search.agg.composite&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.aggregations.bucket.composite&quot;], },  # https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html<br>    {&quot;tag&quot;: &quot;search.agg.nested&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.aggregations.bucket.nested&quot;], },<br>    {&quot;tag&quot;: &quot;search.agg.topHits&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.aggregations.metrics.TopHitsAggregator&quot;], },  # https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html<br>    {&quot;tag&quot;: &quot;search.eql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.eql&quot;]},<br>    {&quot;tag&quot;: &quot;search.eql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.eql&quot;]},<br>    {&quot;tag&quot;: &quot;search.esql&quot;, &quot;strs&quot;: [&quot;[esql_worker]&quot;]},<br>    {&quot;tag&quot;: &quot;search.esql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.esql&quot;]},<br>    {&quot;tag&quot;: &quot;search.esql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.esql&quot;]},<br>    {&quot;tag&quot;: &quot;search.globalOrdinals&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build&quot;, &quot;org.elasticsearch.search&quot;, ], },<br>    {&quot;tag&quot;: &quot;search.kql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.kql&quot;]},<br>    {&quot;tag&quot;: &quot;search.kql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.kql&quot;]},<br>    {&quot;tag&quot;: &quot;search.mustache&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.script.mustache&quot;, &quot;org.elasticsearch.search&quot;], },<br>    {&quot;tag&quot;: &quot;search.prefilter&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.search.CanMatchPreFilterSearchPhase&quot;], },<br>    {&quot;tag&quot;: &quot;search.runtime&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.runtime&quot;, &quot;org.elasticsearch.search&quot;], },<br>    {&quot;tag&quot;: &quot;search.script&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.search.aggregations.pipeline.BucketScriptPipelineAggregationBuilder&quot; ], },<br>    {&quot;tag&quot;: &quot;searchable&quot;, &quot;strs&quot;: [&quot;[searchable_snapshots_cache_fetch_async]&quot;]},<br>    {&quot;tag&quot;: &quot;searchable&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.searchablesnapshots&quot;]},<br>    {&quot;tag&quot;: &quot;searchable&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.searchablesnapshots&quot;]},<br>    {&quot;tag&quot;: &quot;searchable.prewarm&quot;, &quot;strs&quot;: [&quot;[searchable_snapshots_cache_prewarming]&quot;]},<br>    {&quot;tag&quot;: &quot;shrink&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.indices.shrink&quot;]},<br>    {&quot;tag&quot;: &quot;slm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.slm&quot;]},<br>    {&quot;tag&quot;: &quot;slm&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.slm&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;[snapshot]&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;com.amazonaws&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.cluster.repositories&quot;], },<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.action.admin.cluster.snapshots&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.common.blobstore&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.snapshots&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.plugin.repository&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.repositories.blobstore&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.repository.azure&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.snapshots&quot;]},<br>    {&quot;tag&quot;: &quot;snapshot&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.repositories&quot;]},<br>    {&quot;tag&quot;: &quot;sql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.sql&quot;]},<br>    {&quot;tag&quot;: &quot;sql&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.sql&quot;]},<br>    {&quot;tag&quot;: &quot;transform&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.transform&quot;]},<br>    {&quot;tag&quot;: &quot;transform&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.transform&quot;]},<br>    {&quot;tag&quot;: &quot;transform&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.transform&quot;]},<br>    {&quot;tag&quot;: &quot;translog&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.translog&quot;]},<br>    {&quot;tag&quot;: &quot;transport&quot;, &quot;strs&quot;: [&quot;transport_worker&quot;]},<br>    {&quot;tag&quot;: &quot;vectors&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.Engine.getSparseVectorValueCount&quot;]},<br>    {&quot;tag&quot;: &quot;vectors&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.index.engine.Engine.sparseVectorStats&quot;]},<br>    {&quot;tag&quot;: &quot;watcher&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.core.watcher&quot;]},<br>    {&quot;tag&quot;: &quot;watcher&quot;, &quot;strs&quot;: [&quot;org.elasticsearch.xpack.watcher&quot;]}<br>]</pre><p><em>Disclaimer: </em>My understanding is my own and view does not reflect Elastic’s; while information core has been verified with Elastic Dev, I recommend always referring to official sources. I am working on integrating the above into official documentation and welcome feedback.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=370111fab4e7" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Streamlit + Local LLM + PDFs]]></title>
            <link>https://medium.com/@stefnestor/streamlit-local-llm-pdfs-0a7243883a12?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/0a7243883a12</guid>
            <category><![CDATA[ollama]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[pdf]]></category>
            <category><![CDATA[streamlit]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Mon, 22 Apr 2024 21:07:38 GMT</pubDate>
            <atom:updated>2024-04-22T21:07:38.549Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Building off </em><a href="https://medium.com/@stefnestor/python-streamlit-local-llm-2aaa75961d03"><em>earlier outline</em></a><em>, this TLDR’s loading PDFs into your (Python) Streamlit with local LLM (Ollama) setup. Another Github-Gist-like post with limited commentary.</em></p><p>Playing forward this <a href="https://www.analyticsvidhya.com/blog/2023/10/a-step-by-step-guide-to-pdf-chatbots-with-langchain-and-ollama/">Google-result</a> and <a href="https://github.com/srang992/Ollama-Chatbot">its code</a> when searching “local llm pdfs”. My use case is to load <em>all</em> Apple iCloud iBooks into an “oracle”-GPT for private discussions. A sub curiosity is to have two GPTs responding as their author would (potentially across their multiple respective books). The first building block, covered here, is loading PDFs into a local LLM and confirming its PDF-trained results are more desirable (aka. spot-checked accurate) than the generic model.</p><h3>Results</h3><p>Personal test caveats</p><ul><li>I’ll only load a single, random PDF from my iBook storage <em>Reinventing Your Life</em> by Jeffrey E. Young &amp; Janet S. Klosko. On Apple Macs, these iCloud PDFs store under ~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents . My test runs from ~/Downloads and while I could easily reference the PDF from the iBooks folder instead of my test folder, that’s step two.</li><li>I know llama3 <a href="https://ollama.com/library/llama3">came out last week</a>, but so far it hasn’t shown sufficient improvement for me to move off llama2-uncensored and accept the response censoring.</li></ul><p>Comparing the generic LLM (🦙) to the PDF-trained LLM (📓), I was able to compare their results to various questions, e.g.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*L7ZHpxUPh7zThbGZu9GxXA.png" /></figure><p>This image shows the generic LLM hallucinating but the PDF-trained LLM correctly identifying the book’s authors. 👏</p><h3>Code</h3><p>The following has no expectations/warranties, but it “works on my machine” (though as proof-of-concept, its code is ugly, I agree).</p><pre>from langchain import PromptTemplate<br>from langchain.chains import RetrievalQA<br>from langchain.document_loaders import PyMuPDFLoader<br>from langchain.embeddings import HuggingFaceEmbeddings<br>from langchain.llms import Ollama<br>from langchain.text_splitter import RecursiveCharacterTextSplitter<br>from langchain.vectorstores import FAISS<br>import streamlit as st<br><br>llm = Ollama(model=&quot;llama2-uncensored&quot;)<br><br>@st.cache_resource<br>class PdfGpt():<br>    def __init__(self, file_path):<br>        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)<br>        chunks = text_splitter.split_documents(documents=PyMuPDFLoader(file_path=file_path).load())<br>        <br>        embedding_model = HuggingFaceEmbeddings(<br>            model_name=&quot;all-MiniLM-L6-v2&quot;,<br>            model_kwargs={&#39;device&#39;:&#39;cpu&#39;},<br>            encode_kwargs = { &#39;normalize_embeddings&#39;: True }<br>        )<br>        vectorstore = FAISS.from_documents(chunks, embedding_model)<br>        vectorstore.save_local(&quot;vectorstore&quot;)<br>        <br>        template = &quot;&quot;&quot;<br>        ### System:<br>        You are an respectful and honest assistant. You have to answer the user&#39;s questions using only the context \<br>        provided to you. If you don&#39;t know the answer, just say you don&#39;t know. Don&#39;t try to make up an answer.<br><br>        ### Context:<br>        {context}<br><br>        ### User:<br>        {question}<br><br>        ### Response:<br>        &quot;&quot;&quot;<br><br>        self.hey = RetrievalQA.from_chain_type(<br>            llm=llm,<br>            retriever=vectorstore.as_retriever(),<br>            chain_type=&quot;stuff&quot;,<br>            return_source_documents=True, <br>            chain_type_kwargs={&#39;prompt&#39;: PromptTemplate.from_template(template) } <br>        )<br><br>oracle = PdfGpt(&quot;reinventing_your_life.pdf&quot;) # PDF file name<br>ask = st.text_input(&quot;What&#39;s up?&quot;, key=&quot;ask&quot;, label_visibility=&#39;hidden&#39;)<br><br>A,B = st.columns([.05, .95])<br>C,D = st.columns([.05, .95])<br>with A:<br>    st.caption(&quot;🦙&quot;)<br>with C:<br>    st.caption(&quot;📓&quot;)<br><br>if ask not in [None, &quot;&quot;, []]:  <br>    with B:<br>        st.markdown( llm.predict(ask) )<br>    with D:<br>        response = oracle.hey({&#39;query&#39;: ask})<br>        st.markdown( response[&#39;result&#39;] )</pre><p>Say you call this file test.py , you’d run it (in a test where you’re okay with test data caching) after updating the PDF file name reference reinventing_your_life.pdf to your own test PDF, and then starting up Streamlit via streamlit run test.py .</p><p>👋</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0a7243883a12" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[USE Method for Elasticsearch]]></title>
            <link>https://medium.com/@stefnestor/use-method-for-elasticsearch-d976802d8ba6?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/d976802d8ba6</guid>
            <category><![CDATA[triage]]></category>
            <category><![CDATA[tasks]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[opensearch]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Fri, 01 Dec 2023 17:49:55 GMT</pubDate>
            <atom:updated>2023-12-29T17:56:27.893Z</atom:updated>
            <content:encoded><![CDATA[<p>Applying the <a href="https://www.brendangregg.com/usemethod.html">USE Method</a> for troubleshooting down/impaired.</p><p>We’re going to outline SRE incident triaging for Elasticsearch via the infamous <a href="https://www.brendangregg.com/usemethod.html">USE Method</a> with metaphor parallel to <a href="https://www.resus.org.uk/library/abcde-approach">medical ABCDE triaging</a>. (If the metaphor doesn’t work for you, kindly ignore. I highly recommend <a href="https://www.brendangregg.com/usemethod.html">USE Method</a> familiarity for general technical troubleshooting, but it’s not a prerequisite to apply below.)</p><h3>EUS Method</h3><p>“USE” or the trio “Utilization, Saturation, and Errors” outlines for SRE’s to sequentially check “EUS” (though “USE” is easier to remember/reference):</p><ol><li><strong>E</strong>rrors (for literal API/log/stat errors)</li><li><strong>U</strong>tilization (of resources, mainly cpu, heap , disk, network )</li><li><strong>S</strong>aturation (of queues, e.g. <a href="https://en.wikipedia.org/wiki/Thread_pool">thread pools</a>, tasks)</li></ol><p>The <a href="https://www.brendangregg.com/usemethod.html">USE Method</a> blog calls out that to meaningfully investigate these three, investigators first have to determine which system processes (or metaphorically “life blood”) to track.</p><h3>Applied</h3><p>For Elasticsearch, thankfully these are (mostly) explicitly tracked as “tasks” (as <a href="https://medium.com/@stefnestor/elasticsearch-tasks-a77f6b0cb558">we’ve historically covered</a>), but highlighting tasks can induce from and correspondingly surface under</p><ul><li>cluster state/discovery: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.11/cluster-pending.html">Pending Tasks</a></li><li>plugins (varies but top two): <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-explain-lifecycle.html">ILM Explain</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html">Allocation Health</a></li><li>external traffic: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html">Node Tasks</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-thread-pool.html">Threadpools</a></li></ul><p>From this outline, most performance issues can be discovered from only a handful of API’s: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html">Allocation Health</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.11/cluster-pending.html">CAT Pending Tasks</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-tasks.html">CAT Tasks</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-explain-lifecycle.html">ILM Explain</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-thread-pool.html">CAT Threadpools</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html">CAT Nodes</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-allocation.html">CAT Allocation</a>. (If this seems like too many, Elasticsearch Devs agree and have been working on an undocumented <a href="https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml#L227-L234">Internal Health</a> — not covered here.)</p><h3>CheatSheets</h3><p>I wish Medium allowed clickable-PDF-iFrames, but I expect their screenshot versions are at least sufficient breadcrumbs. Therefore, kindly note each underline is an Elasticsearch doc(, usually the API page that returns from that literally searched by that string). You can access the <a href="https://stefnestor.github.io/use_method_elasticsearch.pdf">PDF version here</a>.</p><h4>Triage</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MDmQhgIcee4JlAoCz-6WgQ.png" /></figure><h4>Flushed Out</h4><p>(Non-exhaustive obviously, but most common in my experience.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_BXure7jLn6jUxm-CsB2ZA.png" /></figure><h3>Examples</h3><p>From <a href="https://github.com/elastic/support-diagnostics/">Elasticsearch diagnostics</a>, which mass poll the above and other API endpoints, we can distill a Python <a href="https://docs.streamlit.io/library/api-reference">Streamlit</a> UI to sequentially investigate. (I do recommend alerts as primary triage mechanism instead of manually checking reports, but reports do add needed coloring while flushing out alerting and for those of us with “trust but verify” trust issues.)</p><p><em>Disclaimer</em>s: The following examples were force-derived so won’t fully reflect real triaging scenarios. No screenshots necessarily implicate triaging recommendations vs reflect my automation’s current status.</p><h4>Errors+Saturation</h4><p>A large cluster tripping over ingest requests hitting faster than able to process, partially due to ingest <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html">Hot Spotting</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*64U_VH0yeLNEbIHTDR1jRw.png" /></figure><p>Where we earlier covered troubleshooting: <a href="https://medium.com/@stefnestor/elasticsearch-ingest-rejections-ce97e2e9da00">Breakers+Pressure</a>, <a href="https://medium.com/@stefnestor/elasticsearch-tasks-a77f6b0cb558">Threadpools</a>.</p><h4>Utilization</h4><p>A small cluster hitting <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation">disk watermark</a> and historical ingest <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html">hot spotting</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NIDzhDbqRtEezhhY0gkJRA.png" /></figure><h3>Reference</h3><p>For those who use Elastic’s <a href="https://github.com/elastic/support-diagnostics">Elasticsearch diagnostic</a>, here’s a spatial diagram lined up to this discussion (so color-coded to earlier diagrams) of the pulled file categorization to its backing API (as <a href="https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml">pulled from this code</a>):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UzIiM7hNNrDzKhEDwVzaqg.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d976802d8ba6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[(Python) Streamlit + Local LLM]]></title>
            <link>https://medium.com/@stefnestor/python-streamlit-local-llm-2aaa75961d03?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/2aaa75961d03</guid>
            <category><![CDATA[streamlit]]></category>
            <category><![CDATA[langchain]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[ollama]]></category>
            <category><![CDATA[llama-2]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Thu, 30 Nov 2023 00:29:47 GMT</pubDate>
            <atom:updated>2023-11-30T00:29:47.143Z</atom:updated>
            <content:encoded><![CDATA[<p>Yet-Another-Code-Example for ChatGPT-like localhost LLM</p><p>👋 Howdy, y’all. I’m skipping most context/commentary and treating Medium like a Github Gist for this post.</p><p><strong>Goal</strong>: My friends have been excited by ChatGPT and wanting to run offline, uncensored models but have been experiencing start-up frictions. I want to outline the 5min (including download time) way to get running and the 5min after to get a UI on top. AFAICT this write-up is unique to internet previous, but only someone’s better google-foo will tell.</p><h3>1) Install Ollama</h3><p>The last 9 months the internet has been figuring out the preferred way to run LLMs locally: <a href="https://www.reddit.com/r/LocalLLaMA/comments/12zsjhf/what_is_the_best_current_local_llm_to_run/">Reddit</a>, <a href="https://www.infoworld.com/article/3705035/5-easy-ways-to-run-an-llm-locally.html">top 5 blog</a>, <a href="https://python.langchain.com/docs/guides/local_llms">LangChain</a>. Dealers choice, but we’re just going to go <a href="https://github.com/jmorganca/ollama">Ollama</a> to get llama2-uncensored (means it won’t say “I shouldn’t tell you that” —<a href="https://www.youtube.com/watch?v=Lh7V2_uJhkY"> lol </a>— and it will also emit the swear words nobody should say). So: <a href="https://ollama.ai/download/Ollama-darwin.zip">Mac download link</a> and then in Terminal initialize models</p><pre>$ ollama run llama2 # default<br>$ ollama run llama2-uncensored # 👈 stef default<br>$ ollama list<br>NAME                     ID           SIZE   MODIFIED<br>llama2:latest            a808fc133004 3.8 GB 3 months ago<br>llama2-uncensored:latest 5823fb1154c5 3.8 GB 3 months ago</pre><p>That’s it, that’s your command to run ChatGPT-like LLMs locally. (LLMs have various training data and therefore you’ll notice OpenAI’s is still currently shinier than what you can run locally, but let’s run both to vote for open source and open internet.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7dww62L0X50DD-e8V8bHwQ.png" /></figure><h3>2) Streamlit UI</h3><p>Using Langchain, there’s two kinds of AI interfaces you could setup (<a href="https://python.langchain.com/docs/integrations/callbacks/streamlit">doc</a>, related: <a href="https://streamlit.io/generative-ai">Streamlit Chatbot</a> (<a href="https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps">tutorial</a>) on top of your running Ollama. First install Python libraries:</p><pre>$ pip install langchain duckduckgo-search streamlit</pre><h4>2A) Ask Local Only</h4><p>For company-private data, you can setup a UI which <em>only</em> uses the local LLM …</p><pre>import streamlit as st <br>from langchain.llms import Ollama<br>llm = Ollama(model=&quot;llama2-uncensored:latest&quot;) # 👈 stef default<br><br>colA, colB = st.columns([.90, .10])<br>with colA:<br>    prompt = st.text_input(&quot;prompt&quot;, value=&quot;&quot;, key=&quot;prompt&quot;)<br>response = &quot;&quot;<br>with colB:<br>    st.markdown(&quot;&quot;)<br>    st.markdown(&quot;&quot;)<br>    if st.button(&quot;🙋‍♀️&quot;, key=&quot;button&quot;):<br>        response = llm.predict(prompt)<br>st.markdown(response)</pre><h4>2B) Search the Internet and Answer</h4><p>… But if you’re allowed to use your data/question’s context to search the internet, you can have your LLM Google/DuckDuckGo (example with DDG) …</p><pre>import streamlit as st<br>from langchain.llms import Ollama<br>from langchain.agents import AgentType, initialize_agent, load_tools<br>from langchain.callbacks.manager import CallbackManager<br>from langchain.callbacks import StreamlitCallbackHandler<br>from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler<br>import streamlit as st<br><br>llm = Ollama(<br>    model=&quot;llama2-uncensored:latest&quot;, <br>    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])<br>)<br>tools = load_tools([&quot;ddg-search&quot;])<br>agent = initialize_agent(<br>    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, handle_parsing_errors=True<br>)<br><br>if prompt := st.chat_input():<br>    st.chat_message(&quot;user&quot;).write(prompt)<br>    with st.chat_message(&quot;assistant&quot;):<br>        st_callback = StreamlitCallbackHandler(st.container())<br>        response = agent.run(prompt, callbacks=[st_callback])<br>        # BUG 2023Nov05 can spiral Q&amp;A: https://github.com/langchain-ai/langchain/issues/12892<br>        # to get out, refresh browser page<br>        st.write(response)</pre><h4>2A+B) Combined</h4><p>… And putting those together into just one UI (not pretty but done) …</p><pre>import streamlit as st<br>from langchain.llms import Ollama<br>from langchain.agents import AgentType, initialize_agent, load_tools<br>from langchain.callbacks.manager import CallbackManager<br>from langchain.callbacks import StreamlitCallbackHandler<br>from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler<br><br>search_internet = st.checkbox(&quot;check internet?&quot;, value=False, key=&quot;internet&quot;)<br>prompt = st.text_input(&quot;prompt&quot;, value=&quot;&quot;, key=&quot;prompt&quot;)<br><br>if prompt!=&quot;&quot;:<br>    response = &quot;&quot;<br>    if not search_internet:<br>        llm = Ollama(model=&quot;llama2-uncensored:latest&quot;) # 👈 stef default<br>        response = llm.predict(prompt)<br>    else:<br>        llm = Ollama(<br>            model=&quot;llama2-uncensored:latest&quot;, <br>            callback_manager=CallbackManager([FinalStreamingStdOutCallbackHandler()])<br>        )<br>        agent = initialize_agent(<br>            load_tools([&quot;ddg-search&quot;])<br>            ,llm <br>            ,agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION<br>            ,verbose=True<br>            ,handle_parsing_errors=True<br>        )<br>        response = agent.run(prompt, callbacks=[StreamlitCallbackHandler(st.container())])<br>        # BUG 2023Nov05 can spiral Q&amp;A: https://github.com/langchain-ai/langchain/issues/12892<br>        # to get out, refresh browser page<br>        <br>    st.markdown(response)</pre><h3>Examples</h3><p>To run these code snippets saved as home.py , in that folder’s Terminal run …</p><pre>$ streamlit run home.py</pre><p>… which will auto-open the browser UI for you. Now you’re ready to start googling <a href="https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api">Prompt Engineering</a> to get answers formatted how you’d like …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*R7DJ7LD478BlyI_qWGsAdw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AzoQAPrZh50lai8gIhgVJA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vEQIi4CWTkYSa1oWG7Np-Q.png" /></figure><p>Lastly, I’m unwilling to say better but that’s probably personality in play, but the above can be easily ported back into the <a href="https://doc-chat-llm.streamlit.app">Streamlit Chatbot type</a> of fancy UI. I personally want customer data/email summation which doesn’t need this level of UI, but here’s the shiny:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uF1sq2JkoNWxjfra1vfiQA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2aaa75961d03" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch Ingest Rejections]]></title>
            <link>https://medium.com/@stefnestor/elasticsearch-ingest-rejections-ce97e2e9da00?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/ce97e2e9da00</guid>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <category><![CDATA[circuit-breaker]]></category>
            <category><![CDATA[thread-pool]]></category>
            <category><![CDATA[data-ingestion]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Fri, 24 Nov 2023 23:18:38 GMT</pubDate>
            <atom:updated>2023-11-24T23:18:38.738Z</atom:updated>
            <content:encoded><![CDATA[<p>Protections inducing <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429">HTTP 429</a> rejections and common resolutions during Elasticsearch ingest.</p><p>For Elasticsearch to protect its <a href="https://medium.com/@stefnestor/tldr-elasticsearch-memory-86707167c688">JVM heap resources</a> during ingest <a href="https://medium.com/@stefnestor/elasticsearch-tasks-a77f6b0cb558">task execution</a>, its Dev team has coded three layers of protection that if tripped will induce HTTP 429 errors: A) Circuit Breakers, B) Thread Pools, and C) Indexing Pressure.</p><h3>Protections</h3><h4>A) Circuit Breakers</h4><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html">Circuit Breakers</a> protect the JVM from OutOfMemoryError across various operation types and will induce API response body errors circuit_breaking_exception and log errors CircuitBreakingException and Data too large.</p><p>The most frequent breakers during ingest are [ parent , inflight_requests, request ]. In my experience, Circuit Breaker errors are usually more “straw that broke the camel’s back” quantity related rather than latest request’s quality related. From the API error or logs, you can check if this qualifies as “the final straw” via the new bytes reserved section.</p><pre>Caused by: org.elasticsearch.common.breaker.<strong>CircuitBreakingException</strong>: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [1045167624/996.7mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1045165504/996.7mb], <strong>new bytes reserved</strong>: [2120/2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=10631254/10.1mb, model_inference=0/0b, eql_sequence=0/0b, accounting=35724020/34mb]</pre><p>Also, where you find parent breakers, you can check its child statistics for top-offender child breakers. For this example just above, the stand-outs are [ in_flight_requests , accounting ] where the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#in-flight-circuit-breaker">former is related</a> to literal HTTP/API bytes and the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#accounting-circuit-breaker">latter is related</a> to Lucene shard overhead.</p><p>This specific log’s situation root caused under <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Bulk API</a> Task backups which also surfaced in the next section (B) below. See also <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.11/circuit-breaker-errors.html">Elastic’s troubleshooting doc</a>.</p><p>To check your cluster for Circuit Breakers (which <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html">Node Stats</a> are cumulative since <a href="https://medium.com/@stefnestor/es-node-uptime-755a76acb5e9">Node Uptime</a>), you can run (using <a href="https://jqlang.github.io/jq/">JQ for JSON-parsing</a>):</p><pre>&gt; GET _nodes/stats?human=true&amp;filter_path=nodes.*.breakers<br><br># filtered tripped at least once<br>$ cat nodes_stats.json | jq -c &#39;.nodes[]|.name as $node|.breakers|to_entries[]|{node:$node, circuitBreaker:.key, tripped_count:.value.tripped}|select(.tripped_count&gt;0)&#39;</pre><p>If you ever end up with too much time on your hands, like me, you might create a UI (via Python <a href="https://docs.streamlit.io">Streamlit</a>) to calculate hourly tripped breakers and allow quick filtering in/out (in this case elected master was parent circuit breaking):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SvH-o849D6agps1QSHbjfA.png" /></figure><h4>B) Thread Pools</h4><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html">Thread Pools</a> allow Elasticsearch to allocate memory consumption and queue tasks across topics. (Elasticsearch is just re-using the <a href="https://en.wikipedia.org/wiki/Thread_pool">industry thread pool</a> term.) The pool most associated to ingest is write however a) ingest pipeline asynchronous processes may induce tasks under other thread pools and b) writes to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/api-conventions.html#system-indices">system indices</a> pool under [ system_write , system_critical_write ] but since users don’t write to these they rarely come up.</p><p>Each thread pool has its own queue and processing limits which if surpassed will induce EsRejectedExecutionException with either QueueResizingEsThreadPoolExecutor or queue capacity in its API error / log. Maxing queues is most commonly associated to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html">Hot Spotting</a> or Circuit Breakers (section (A) above). (I wrote this Elastic-official Hot Spotting doc — with Elasticsearch Dev sign-off— so do highly recommend it and am always open to feedback to improve it further.)</p><p>You can inspect current write Thread Pool queues via <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-thread-pool.html">CAT Threadpool</a> or via <a href="https://medium.com/@stefnestor/elasticsearch-cat-alternatives-315f72ea6d5e">CAT API alternatives</a> via the following (again using <a href="https://jqlang.github.io/jq/">JQ for JSON-parsing</a>):</p><pre>&gt; GET _cat/thread_pool/write?v=true&amp;s=n,nn&amp;h=n,nn,q,a,r,c<br>&gt; GET _nodes/stats?human=true&amp;filter_path=nodes.*.thread_pool.write<br><br># filtered tripped at least once<br>$ cat nodes_stats.json | jq -rc &#39;.nodes[]|.name as $n|.thread_pool.write|{name:$n, queue: .queue, active:.active, completed:.completed, rejected:.rejected}|select(.rejected&gt;0)&#39;</pre><p>Same UI mockup conversation above, this was a recent output I produced which ended up showing historical but not current write <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html">hot spotting</a> (you’ll note the third data node has a minimal hourly_completed ):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wxoY7cNUvNg4AKZC2kksLw.png" /></figure><h4>C) Indexing Pressure</h4><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-indexing-pressure.html">Indexing Pressure</a> was introduced v7.9 but still seems to be the least understood Elasticsearch feature (though fairly well documented). This allows Elasticsearch to protect data integrity during write operations (e.g. indexing, shard recoveries, CCR) by reserving heap during [ coordinating, primary, replica ] write phases per write operation.</p><p>Here’s a quick diagram of write phases as seen by Elastic’s <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.7/docs-replication.html#basic-write-model">write model</a> as <a href="https://www.elastic.co/pdf/elasticsearch-sizing-and-capacity-planning.pdf">diagrammed</a> (2019):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9kbU1t3BwT2A5LCtVp4VRA.png" /><figcaption>Elasticsearch’s write model</figcaption></figure><p>(Side note: I like <a href="https://luis-sena.medium.com/the-complete-guide-to-increase-your-elasticsearch-write-throughput-e3da4c1f9e92">this Medium article</a> by Luiz Sena about an alternative perspective on Elasticsearch’s write model.)</p><p>Surpassing limits errors EsRejectedExecutionException with coordinating_and_primary_bytes. In my experience, this usually surfaces under section (B) circumstances above and it’s a coin-toss if (A) or this flags first. If this flags when there’s no evidence for (A) or (B), Elastic historically recommends reducing bulk max size for ingest (which’d allow the Thread Pool to queue and handle throughput on its layer instead as preferred queue mechanism).</p><p>To inspect current and/or historical via <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html">Node Stats</a> (again using <a href="https://jqlang.github.io/jq/">JQ for JSON-parsing</a>):</p><pre>&gt; GET _nodes/stats?human=true&amp;filter_path=nodes.*.indexing_pressure<br><br># manually check against limit, noting replica max is 1.5*limit<br>$ cat nodes_stats.json | jq  -c &#39;.nodes[]|select(.thread_pool.write.queue&gt;0)|{node:.name, limit:.indexing_pressure.memory.limit, all:.indexing_pressure.memory.current.all, c_and_p:.indexing_pressure.memory.current.combined_coordinating_and_primary, c:.indexing_pressure.memory.current.coordinating, p:.indexing_pressure.memory.current.primary, r:.indexing_pressure.memory.current.replica }&#39;</pre><h3>Root Cause</h3><p>Assuming one of these three flagged, to resolve we need to first make sure (copied-forward from (B) above) that we’re <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html">not Hot Spotting</a> (which means software’s unevenly using available hardware resources). I can’t stress this enough as most common reason for issues of a previously right-sized hardware-vs-software cluster.</p><p>After, there’s various Elastic-official and unofficial online blogs which circle the same ballpark of actions covered in <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.11/tune-for-indexing-speed.html">Elasticsearch’s troubleshooting docs</a>: <a href="https://innovation.ebayinc.com/tech/engineering/elasticsearch-performance-tuning-practice-at-ebay/">EBay</a>, <a href="https://www.datadoghq.com/blog/elasticsearch-performance-scaling-problems/">DataDogHQ</a>, various Opster articles.</p><h4>Settings</h4><p>Where I usually end up recommending folks consider starting within Elasticsearch settings …</p><ul><li>(temporarily) undo any ingest/recovery cluster setting overrides (e.g. I’m looking at everybody who leaves cluster.routing.allocation.node_concurrent_recoveries (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#cluster-shard-allocation-settings">doc</a>) overrode and then one day it blows up in their face)</li><li>increase target index(/ices) refresh_interval (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings">doc</a>) which defaults 1s but even 5s is fairly unnoticeable to humans but helps the database a lot</li><li>right-size number_of_shards (aka. primaries, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#_static_index_settings">doc</a>) or target as multiple of (applicable) nodes</li></ul><p>… then client-side you’ll look to a) verify you have upstream queue’ing and b) have right-sized its works and bulk sizes to Elasticsearch. For Elastic’s products, the most common related docs for (b) are:</p><ul><li>Elastic Agents’ <a href="https://www.elastic.co/guide/en/fleet/current/elasticsearch-output.html#output-elasticsearch-performance-tuning-settings">tuning settings</a> under <a href="https://www.elastic.co/guide/en/fleet/current/fleet-settings.html#output-settings">Fleet Settings UI</a></li><li>Logstash <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">pipeline.batch.size</a> and <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">workers</a></li><li>Filebeat <a href="https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#_configuration_options_17">output to Elasticsearch</a> via <a href="https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#_configuration_options_17">worker</a>, <a href="https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#_configuration_options_17">bulk_max_size</a></li></ul><h4>Context</h4><p>The back-end on “why these settings?” relates to Lucene’s (which Elasticsearch sits on top of) performance weighting heavier for its internal merging task than it does for its ingesting task. See the <a href="https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html">infamous 2011 video/blog</a> for context:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FYW0bOvLp72E&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DYW0bOvLp72E&amp;image=http%3A%2F%2Fi.ytimg.com%2Fvi%2FYW0bOvLp72E%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/9ccb827c618be083fd176d63ef99e118/href">https://medium.com/media/9ccb827c618be083fd176d63ef99e118/href</a></iframe><p>The takeaway is Lucene/Elasticsearch prefers large initial segment sizes to reduce overall segment merging needs and admins can encourage that via the settings outlined earlier. This is also the root cause on why Elasticsearch <em>emphatically</em> recommends <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Bulk ingest</a> rather than <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html">Doc ingest</a>.</p><p>(Y’all probably don’t need this last graphic, but I use it a lot in my conversations and want somewhere public internet to paste it. Realizing now I went right-to-left when western media usually does left-to-right, sorry.) As a final graphic on how Lucene merges to its eventual happiest segment size of 5GB, the majority win is the initial segment sizing which admins majorly encourage via refresh_interval and bulk sizing type settings:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tYnzds5UhbjHAjgKF8cEoA.png" /></figure><p>👋</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ce97e2e9da00" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Streamlit + iTerm2 (Python)]]></title>
            <link>https://medium.com/@stefnestor/streamlit-iterm2-python-4d7aa41bc4ab?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/4d7aa41bc4ab</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[iterm2]]></category>
            <category><![CDATA[streamlit]]></category>
            <category><![CDATA[bash]]></category>
            <category><![CDATA[automation]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Tue, 26 Sep 2023 18:19:45 GMT</pubDate>
            <atom:updated>2023-09-26T18:19:45.832Z</atom:updated>
            <content:encoded><![CDATA[<p><em>How-to spin-up iTerm2 session from Python3 Streamlit library’s UI</em></p><p>Hello! This one’ll be short, but documenting this automation building-block to show my team.</p><p><strong>Use Case</strong>: I use Salesforce’s sfdx <a href="https://developer.salesforce.com/tools/sfdxcli">CLI</a> to pull Case feeds to my local disk. Annoyingly, it doesn’t have a Python library, so I surfaced its data into a <a href="https://docs.streamlit.io/library/api-reference">Streamlit</a> UI via Python’s subprocess. Separately, I’d written a Bash automation to minimize <a href="https://github.com/elastic/support-diagnostics/">Elastic’s Stack Diagnostics</a> polling impact. Naturally, at some point my <a href="https://www.alfredapp.com/">Alfred</a> automation connecting these two became burdensome to version and share (since it costs $). So I turned towards <a href="https://iterm2.com">iTerm2</a> (my default Mac terminal) for a better, sharable, and free automation.</p><h3><strong>Write-up</strong></h3><h4>Technical Context</h4><p>iTerm2 allows <a href="https://iterm2.com/python-api/tutorial/index.html">Python script automations</a>. You can script against iTerm from Python via pip’s iterm2 library <a href="https://iterm2.com/python-api-auth.html">after enabling it</a>. There’s some pretty good beginner <a href="https://iterm2.com/python-api/tutorial/example.html">examples</a>. (See also <a href="https://github.com/search?q=%22import+iterm2%22&amp;type=code">Github code</a> for more examples.)</p><p>It appears there’s some restrictions on the iterm2 library’s abilities that it can kick off requests but not hear the response like subprocess would be able to do. (Workaround examples: <a href="https://jongsma.wordpress.com/2020/02/19/exploring-the-iterm2-python-api/">this</a>, <a href="https://github.com/paulomcnally/iterm2-scripts/blob/b8851e6fde98e59b231d00bb8fa77c89faf5e073/ruby.py#L4">this</a>.)</p><p>I’m going to skip outlining troubleshooting gotchas and just mention:</p><ul><li>Python opening iTerm needs to be done as an asynchronous task via asyncio and requires working around <a href="https://github.com/streamlit/streamlit/issues/744#issuecomment-1491780114">this Streamlit bug</a>.</li><li>We’re automating via iterm2 and not subprocess because iTerm loads your Mac’s .bash_profile (which is common entry-point for loading git-versioned <a href="https://www.freecodecamp.org/news/dotfiles-what-is-a-dot-file-and-how-to-create-it-in-mac-and-linux/">Dotfiles</a>) so we don’t have to recreate Bash functions we’re already manually running in iTerm again in our Python code.</li></ul><h4>Design</h4><p>I’m going to outline the automation’s <a href="https://en.wikipedia.org/wiki/Minimum_viable_product">MVP</a> since it looks like there’s not that much Google content previously written in this ballpark. The design flow we’d hope for is:</p><ol><li>Open iTerm and start streamlit UI streamlit run home.py</li><li>In Streamlit’s UI (default localhost:8501 ) have a button to open new iTerm tab</li><li>Once new iTerm tab is open, change directory to the Salesforce Case ID and start running Elastic’s diagnostic via pre-built Bash automation</li></ol><p>Steps 1 and 2 will be done in the Python code; step 3 will trigger from Python but run off the Bash/Dotfiles’ code. (The Bash/Dotfiles’ code will be explained but not outlined here.)</p><h4>Code</h4><p>So we’ll write the Python code under home.py which can be ran via python3 home.py (just in iTerm) and/or streamlit run home.py (displays UI).</p><p>(<em>Note</em>: I’ll leave a comment-block in the code below where iterm2 code works unless streamlit code is running to highlight where you may need to pivot from online examples when building out your own automations. Bug results explained at bottom.)</p><pre>import asyncio <br>import iterm2<br>import streamlit as st<br><br>### streamlit bug: https://github.com/streamlit/streamlit/issues/744#issuecomment-1491780114<br>def get_or_create_eventloop():<br>    try:<br>        return asyncio.get_event_loop()<br>    except RuntimeError as ex:<br>        if &quot;There is no current event loop in thread&quot; in str(ex):<br>            loop = asyncio.new_event_loop()<br>            asyncio.set_event_loop(loop)<br>            return asyncio.get_event_loop()<br>asyncio.set_event_loop(get_or_create_eventloop())<br>###<br><br>async def async_iTerm(connection):<br>    app = await iterm2.async_get_app(connection)<br>    window = app.current_window<br>    if window==None:<br>        sys.exit(&quot;👻 No current iTerm window&quot;)<br>    <br>    ### BLOCKING ERROR:: websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received<br>    ## does not work when ran via Streamlit<br>    # tab = await window.async_create_tab(profile=&quot;🥷&quot;, command=f&quot;/bin/bash goto {number}&quot;)<br>    ###<br><br>    # my iTerm profile is called &quot;🥷&quot;<br>    tab = await window.async_create_tab(profile=&quot;🥷&quot;)<br>    await tab.async_set_title(number)<br>    <br>    session = app.current_terminal_window.current_tab.current_session<br><br>    # 1. this kicks off Bash commands where &quot;goto&quot; and &quot;diagme&quot; are my custom Dotfile functions<br>    # 2. adding &quot;\n&quot; on the end submits the command in iTerm so it also executes rather than just populating the text<br>    await session.async_send_text(&#39;echo hello\n&#39;)<br>    await session.async_send_text(f&#39;goto {number}\n&#39;)<br>    await session.async_send_text(f&#39;diagme\n&#39;)<br>    print(&quot;👋&quot;)<br><br>def open_case_iterm(number):<br>    iterm2.run_until_complete(async_iTerm,number)<br><br># ---<br># usually, above code would be under a controller and below under a view of Python MVC model code<br># ---<br><br># example Salesforce Case Number ID, would be set by user or dynamically in non-MVP code<br>number = &quot;01486356&quot;<br><br>if st.button(&quot;Open iTerm&quot;, key=&quot;iterm&quot;):<br>    open_case_iterm(number)</pre><h4><strong>Demo</strong></h4><p>This MVP is quite minimal on use-case details but proves sufficient technical viability for us to consider it a working automation building block. We’ll start streamlit …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6uXtCefRY_Rjei8YSd880A.png" /></figure><p>… which automatically opens its UI showing our button. Once we click our “Open iTerm” button …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*34DMzGHXdAicQCBZkNcPuA.png" /></figure><p>… iTerm will open a new tab, run echo hello , run my change-directory Dotfile automation goto, and finish by starting my diagnostic automation diagme …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mVC6aDSd135QdxK-ul0CHg.png" /></figure><p><em>MVP bug</em>: The original iTerm tab may end up reporting connection errors …</p><pre>Task exception was never retrieved<br>future: &lt;Task finished name=&#39;Task-386&#39; coro=&lt;Connection._async_dispatch_to_helper() done, defined at /opt/homebrew/lib/python3.11/site-packages/iterm2/connection.py:301&gt; exception=ConnectionClosedError(None, Close(code=1000, reason=&#39;&#39;), None)&gt;<br>Traceback (most recent call last):<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 959, in transfer_data<br>    message = await self.read_message()<br>              ^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 1029, in read_message<br>    frame = await self.read_data_frame(max_size=self.max_size)<br>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 1104, in read_data_frame<br>    frame = await self.read_frame(max_size)<br>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 1161, in read_frame<br>    frame = await Frame.read(<br>            ^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/framing.py&quot;, line 68, in read<br>    data = await reader(2)<br>           ^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/streams.py&quot;, line 731, in readexactly<br>    raise exceptions.IncompleteReadError(incomplete, n)<br>asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of 2 expected bytes<br><br>The above exception was the direct cause of the following exception:<br><br>Traceback (most recent call last):<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/connection.py&quot;, line 309, in _async_dispatch_to_helper<br>    if await helper(self, message):<br>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/notifications.py&quot;, line 550, in _async_dispatch_helper<br>    await handler(connection, sub_notification)<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/app.py&quot;, line 380, in _async_focus_change<br>    await self.async_refresh()<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/app.py&quot;, line 274, in async_refresh<br>    layout = await iterm2.rpc.async_list_sessions(self.connection)<br>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/rpc.py&quot;, line 33, in async_list_sessions<br>    return await _async_call(connection, request)<br>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/rpc.py&quot;, line 884, in _async_call<br>    await connection.async_send_message(request)<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/iterm2/connection.py&quot;, line 254, in async_send_message<br>    await self.websocket.send(message.SerializeToString())<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 635, in send<br>    await self.ensure_open()<br>  File &quot;/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py&quot;, line 935, in ensure_open<br>    raise self.connection_closed_exc()<br>websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received<br>👋</pre><p>… which appear to be non-blocking when commands are sent via session.async_send_text instead of under the window.async_create_tab command so ignoring for MVP purposes …</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HqtNl-Ozu65ZfGABDG_mMg.png" /></figure><p><em>🦖 happy coding</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4d7aa41bc4ab" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Elasticsearch Data Health]]></title>
            <link>https://medium.com/@stefnestor/elasticsearch-data-health-5e760d303b49?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/5e760d303b49</guid>
            <category><![CDATA[database-administration]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[allocation]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <category><![CDATA[monitoring]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Mon, 04 Sep 2023 21:56:32 GMT</pubDate>
            <atom:updated>2023-09-04T21:56:32.141Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://jqlang.github.io/jq/">JQ</a> commands to troubleshoot Cluster Health yellow/red.</p><h3>Theory</h3><p>Elasticsearch reports a single status under <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html">Cluster Health</a> to represent the roll-up of the data’s health via its shards’ status and indices’ health . <a href="https://cloud.elastic.co/">Elastic Cloud</a> elaborates this summary for a Deployment under its Health menu:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fyHfk4t6b110wyJPKKzYrQ.png" /><figcaption>Elastic Cloud &gt; Deployment &gt; Health</figcaption></figure><p>The Elastic Cloud UI correlates warnings/errors to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html#cluster-health-api-response-body">Elasticsearch’s yellow/red definitions</a>:</p><blockquote>Health status of the cluster, based on the state of its primary and replica shards. Statuses are:</blockquote><blockquote>green : All shards are assigned.</blockquote><blockquote>yellow : All primary shards are assigned, but one or more replica shards are unassigned. If a node in the cluster fails, some data could be unavailable until that node is repaired.</blockquote><blockquote>red : One or more primary shards are unassigned, so some data is unavailable. This can occur briefly during cluster startup as primary shards are assigned.</blockquote><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html">Elasticsearch docs then state</a> you next introspect why the shard(s) aren’t allocating via <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html">Allocation Explain</a> and from there data recovery becomes situationally unique and potentially on a shard-by-shard basis. (Most common situations are outlined <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html#fix-red-yellow-cluster-status">here</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-allocation-explain.html#cluster-allocation-explain-api-examples">here</a>.)</p><p>After finding Mincong Huang’s <a href="https://mincong.io/en/shard-allocation-deciders/">Shard Allocation Deciders</a> article, I became intrigued to figure out a way to pull an aggregated view of problematic shards with their causes/solutions.</p><p>This <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#shards-rebalancing-settings">Allocation</a> summary has been an ongoing discussion, example <a href="https://github.com/elastic/elasticsearch/issues/80892">here</a> and <a href="https://github.com/elastic/elasticsearch/issues/80787">here</a>. Since I haven’t distilled as fast at I projected I might, I wanted to share the stop-gaps I’ve learned so far to deepen insight and outline my investigate flow to speed up recovery.</p><h3>Summarize</h3><p>Since Cluster Health is low-level effectively just reporting status:UNASSIGNED shards (which isn’t 100% true if shards are recovering from snapshot but leaving that aside), my first question was where to best pull this data. Hence my <a href="https://medium.com/@stefnestor/elasticsearch-cat-alternatives-315f72ea6d5e">ES CAT Alternatives</a> investigation. From this I learned there’s a “routing table” inside the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html">Cluster State</a>:</p><pre>&gt; GET _cluster/state/routing_table?filter_path=routing_table.indices.*.shards.*.unassigned_info</pre><p>This actually has more (though still brief) data than <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html">CAT Shards</a> (e.g the time the shard became unassigned). Once you have this Cluster State stored locally as cluster_state.json , you can use a <a href="https://jqlang.github.io/jq/">third-party tool JQ</a> (or your favorite tool) to JSON-parse this response into meaningful aggregations. The two I prefer to summarize are:</p><h4>Cause</h4><p>This is the system-stored reason for the shard becoming unassigned</p><pre>$ cat cluster_state.json | jq -rc &#39;.routing_table.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[] as $v|select($v.state==&quot;UNASSIGNED&quot;)|[$v.unassigned_info.reason]|@tsv&#39; | sort -r | column -t | head</pre><p>This outputs a table of causes and shard counts for that bucket. Example causes:</p><ul><li>NODE_LEFT : Most common in my experience <em>by far</em>. This is when a node <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-fault-detection.html">ungracefully leaves the cluster</a> (e.g. either from network issues or resource overwhelm).</li><li>CLUSTER_RECOVERED : In my experience, this is a second-step to the bullet above when only replicas remain to allocate.</li><li>PRIMARY_FAILED : In my experience, this happens when you have disk corruption. It is <em>very</em> uncommon, but e.g. when you have <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html#path-settings">security scanners running against</a> Elasticsearch’s data.path .</li></ul><p>Knowing the cause is usually most of the battle towards determining the resolution. At the very least, it’s helpful to know how many resolution paths you’ll need to do to recover all data (rather than Elasticsearch recovery has historically, unnecessarily been socially phrased as a resolve x and see if there’s still an x+1).</p><h4>Status</h4><p>Allocation “status” is as close as we’ll get towards a database recommended recovery path, but it’s still fairly informative. There’s three data-points (two more from the one above) from <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html">Allocation Explain</a> which are kept in the Cluster State and not just dynamically determined when the API is executed.</p><pre>$ cat cluster_state.json | jq -rc &#39;.routing_table.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[] as $v|select($v.state==&quot;UNASSIGNED&quot;)|[$v.primary, $s, $i, $v.unassigned_info.allocation_status, $v.unassigned_info.reason, $v.recovery_source.type]|@tsv&#39; | sort -r | column -t | head</pre><p>Please note, there’s no official documentation that I can find on interpreting these, so these are my working definitions:</p><ul><li>allocation_status : designates if a shard has attempted recovery or not and what was determined. E.g. no_attempt indicates hitting <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html">Cluster Reroute</a> with retry_failed=true and then it resolves or updates to something else. E.g. alternatively no_valid_shard_copy indicates data corruption requiring re-ingesting or <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html">restoring from snapshot</a>.</li><li>reason : This is the “cause” described above, hence why I was saying it was “most of the battle” on understanding recovery. This is also how I determine root-cause analysis (RCA) for my clusters.</li><li>recovery_source : designates where the correct data is expected to be. E.g. PEER is when a replica copies off a primary. E.g. SNAPSHOT is when a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html">Searchable Snapshot</a> or restore from snapshot is processing. E.g. EXISTING_STORE corresponds to no_valid_shard_copy frequently.</li></ul><h3>Automate</h3><p>At this point, you’ll now have a high-level overview of all status:UNASSIGNED shards in your cluster. In my personal setup, I have Python automations to react to different data points. The following examples for the mentioned data points <em>may</em> potentially apply to your use case as well but should be analyzed first to guarantee cluster integrity first-and-foremost:</p><ul><li>Receiving no_attempt kicks off <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html">Cluster Reroute</a> with retry_failed=true as commented above. This is quite safe.</li><li>As long as <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html">CAT Nodes</a> reports healthy cpu and heap , then for CLUSTER_RECOVERED , THROTTLED , or PEER I’ll temporarily <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-update-settings.html">override Cluster Settings</a> for recovery rates: cluster.routing.allocation.node_concurrent_recoveries (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#cluster-shard-allocation-settings">doc</a>), indices.recovery.max_bytes_per_sec (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/recovery.html#recovery-settings">doc</a>).</li><li>I wrote an automation which does a full investigation and recovery for when it finds no_valid_shard_copy . Thankfully it’s infrequent, but it’s always a heart wrenching notification to receive.</li></ul><h3>Manual</h3><p>Allocation issues come in wide and varied flavors. The handful of most frequent are +80% of the problems I work on, so automations are meaningful time investments which maintain your uptime SLO’s during database outages. However, at some point you will encounter the other -20% and need to manually recover.</p><p>At this point, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html">Elasticsearch’s guide</a> appropriately tells you to load <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html">Allocation Explain</a> to execute the live report of why the shard won’t currently allocate. But then the training wheels come off and you’re left to determine which deciders matter to you and how to interpret how to bypass them.</p><p>I’m not going to have a case-by-case answer, but let me outline the files I usually cross-compare when it’s a software-based issue:</p><ul><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-get-settings.html">Cluster Settings</a> : Mostly *.routing.allocation.* . Comes up settings are literally called out in the output or when no shards or no shards of a type (e.g. replicas) can allocate. Also most frequent settings-based errors too many shards or maximum shards open relate to cluster.max_shards_per_node .</li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-settings.html">Index Settings</a> : Specifically .routing.allocation to determine any conflicting settings. Usually will be <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html">ILM</a>’s .include._tier_preference conflicting with not fully deprecated <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html">Node Attribute</a> settings. Which brings up …</li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-get-lifecycle.html">ILM Policies</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-explain-lifecycle.html">ILM Explain</a> : To understand where the index thinks it wants to be.</li><li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html">CAT Recovery</a> : When flagged active_only=true this lets you watch the shard’s data recovery pace. (For those like me who trust but default verify.)</li></ul><p>I still think there’s a lot of unorganized low-hanging fruit which’ll allow this conversation to standardize and automate further; but for now, this is the stop-gap I’ve used to build my automations.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5e760d303b49" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Logstash Copy Elasticsearch Doc ID]]></title>
            <link>https://medium.com/@stefnestor/logstash-copy-elasticsearch-doc-id-a1089e605f85?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/a1089e605f85</guid>
            <category><![CDATA[logstash]]></category>
            <category><![CDATA[integration]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Fri, 14 Apr 2023 22:03:55 GMT</pubDate>
            <atom:updated>2023-04-14T22:03:55.283Z</atom:updated>
            <content:encoded><![CDATA[<p>TLDR on copying the Index Document’s _id into another field in Logstash.</p><p>When copying data between <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html">Elasticsearch</a> clusters, sometimes we’ll want to copy-save the original _id for long-term referencing. As a quick <a href="https://www.elastic.co/guide/en/logstash/current/introduction.html">Logstash</a> export example from an <a href="https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html">Elastic Cloud</a> Elasticsearch cluster, we can setup a test pipeline</p><pre>$ pwd<br>~/downloads<br><br>$ cat test.conf<br>input {<br>  elasticsearch {<br>    cloud_id =&gt; &quot;REDACTED&quot;<br>    cloud_auth =&gt; &quot;elastic:changeme&quot;<br>    index =&gt; &quot;.internal.alerts-security.alerts-default-000001&quot;<br>    size =&gt; 1<br>    docinfo =&gt; true<br>    docinfo_target =&gt; &quot;[@metadata]&quot;<br>  }<br>}<br><br>filter {<br>  mutate {<br>    add_field =&gt; { &quot;doc_id&quot; =&gt;  &quot;%{[@metadata][_id]}&quot; }<br>  }<br><br>  prune {<br>    whitelist_names =&gt; [ &quot;doc_id&quot; ]<br>  }<br>}<br><br>output {<br>  stdout {<br>    codec =&gt; rubydebug { metadata =&gt; true }<br>  }<br>}</pre><p>Where <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html#plugins-filters-mutate-add_field">Mutate</a> is the core Logtash <a href="https://www.elastic.co/guide/en/logstash/current/filter-plugins.html">Filter</a> we’re going for and <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-prune.html">Prune</a> just simplifies our example. I’ll highlight, we needed to enable Logstash <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html">Elasticsearch Input</a>’s <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html#plugins-inputs-elasticsearch-docinfo">doc_info</a> and <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html#plugins-inputs-elasticsearch-docinfo_target">docinfo_target</a> to make this work.</p><p>Once setup, we can use Docker to run our example pipeline</p><pre>$ docker run --name logstash --rm -it -p 5044:5044 -p 9600:9600 -v ~/downloads/:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:8.6.2</pre><p>Which will emit, for our test, the original _id and our copied doc_id field.</p><pre>{ <br>  &quot;@metadata&quot; =&gt; {<br>    &quot;_id&quot; =&gt; &quot;fb68ad73cd755f8f8d7e638cd77acde1e03ab057581a479508987c661ad1de69&quot;,<br>    &quot;_index&quot; =&gt; &quot;.internal.alerts-security.alerts-default-000001&quot;,<br>    &quot;_type&quot; =&gt; nil <br>  },<br>  &quot;doc_id&quot; =&gt; &quot;fb68ad73cd755f8f8d7e638cd77acde1e03ab057581a479508987c661ad1de69&quot;<br>}<br></pre><p>🎉</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a1089e605f85" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Snippet: Air gap Elastic]]></title>
            <link>https://medium.com/@stefnestor/snippet-air-gap-elastic-7615f9f36be4?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/7615f9f36be4</guid>
            <category><![CDATA[airgapped]]></category>
            <category><![CDATA[elastic]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <category><![CDATA[airgap]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Mon, 03 Apr 2023 15:47:10 GMT</pubDate>
            <atom:updated>2023-04-03T15:47:10.431Z</atom:updated>
            <content:encoded><![CDATA[<p>Summary of setup considerations for air-gapped environments.</p><h3>Need</h3><p>Some organizations require their computer network to be <a href="https://en.wikipedia.org/wiki/Air_gap_(networking)">air-gapped</a>. Elastic’s core products can be easily downloaded (via <a href="https://www.elastic.co/downloads/">direct</a> or <a href="https://www.docker.elastic.co/">docker</a>) and transferred to satisfy this requirement. After, users still sometimes encounter start-up errors due to miss configured sub-features still attempting to reach out across the internet for supplementary data, especially sub-domains to <em>elastic.co</em>.</p><h3>Response</h3><p>The Elastic ecosystem will not randomly reach out to <em>[epr,artifacts].elastic.co</em> but may when e.g. <a href="https://www.elastic.co/guide/en/cloud-enterprise/current/ece-install-offline.html">installing ECE</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.4/geoip-processor.html#manage-geoip-database-updates">Elasticsearch’s GeoIP</a> (<a href="https://www.elastic.co/guide/en/logstash/7.17/plugins-filters-geoip.html#plugins-filters-geoip-manage_update">same for Logstash</a>), <a href="https://www.elastic.co/guide/en/fleet/current/air-gapped.html#host-artifact-registry">installing/upgrading Agents</a> or <a href="https://www.elastic.co/guide/en/fleet/current/air-gapped.html">Fleet setups</a>. (As applicable to your use case, linked docs show how to pivot for air-gapped environments.) For exhaustiveness, other Elastic endpoints/IPs your setup could attempt would relate to <a href="https://www.elastic.co/guide/en/kibana/current/maps-connect-to-ems.html#elastic-maps-server">Maps</a> or <a href="https://www.elastic.co/guide/en/kibana/8.4/telemetry-settings-kbn.html">Telemetry</a>. Not <a href="https://www.elastic.co/guide/en/security/8.4/prebuilt-rules-api.html">Prebuilt Rules</a>. Potentially surfacing from third-party dependencies, non-Elastic endpoints would be <a href="https://www.elastic.co/guide/en/logstash/current/offline-plugins.html">Logstash plugin installs</a> (to <em>rubygems.org</em>).</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7615f9f36be4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Diagnose Kibana Discover]]></title>
            <link>https://medium.com/@stefnestor/diagnosing-kibana-discover-ui-8f6b4fd7df9f?source=rss-3884b0aa8da5------2</link>
            <guid isPermaLink="false">https://medium.com/p/8f6b4fd7df9f</guid>
            <category><![CDATA[discover]]></category>
            <category><![CDATA[kibana]]></category>
            <category><![CDATA[page-load-time]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[elasticsearch]]></category>
            <dc:creator><![CDATA[Stef Nestor]]></dc:creator>
            <pubDate>Sat, 21 Jan 2023 19:32:21 GMT</pubDate>
            <atom:updated>2023-12-01T17:57:19.093Z</atom:updated>
            <content:encoded><![CDATA[<blockquote><a href="https://www.elastic.co/blog/troubleshooting-guide-common-issues-kibana-discover-load">Now an official Elastic blog</a> (with some Dev corrections).</blockquote><p>Troubleshooting guide for the Elastic company’s Kibana product’s <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a> UI view related to long loads, time outs, and errors. (Not reviewing data ingest lag or data quality after full page load.)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7U4avUs0fMism-dIXYA1_g.png" /><figcaption>Kibana’s Discover page v8.6.0, loaded on defaults</figcaption></figure><h3>Summary</h3><p><a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a> is Elastic’s core Kibana UI to search, filter, and inspect (time series) data. <a href="https://www.elastic.co/guide/en/kibana/8.6/get-started.html#view-and-analyze-the-data">Visualizations</a> are used for data aggregations/summaries. The Discover UI is resilient to large data Elasticsearch responses, but can sometimes experience issues due to (uncompressed) response size, <a href="https://www.elastic.co/blog/found-crash-elasticsearch#mapping-explosion">mapping explosion</a>, and browser limits. Below we’ll summarize most common historical issues and the sequential troubleshooting walk through.</p><h3>Walkthrough</h3><p>After establishing and loading a user session, Kibana will load Discover via base URI /app/discover (or its related <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a> specific URI). To load this page, the browser page will sequentially request three API’s from the Kibana server (and through Kibana to the below Elasticsearch server as needed).</p><p>💡If the Kibana page errors on load, you’ll want to open your <a href="https://support.happyfox.com/kb/article/882-accessing-the-browser-console-and-network-logs/">browser’s network tab</a> to confirm which sequential request ends up failing. You can share your findings by exporting a <a href="https://support.zendesk.com/hc/en-us/articles/4408828867098">HAR log</a>.</p><h4>1. Load Index Pattern</h4><p>The browser page will request Kibana’s <a href="https://www.elastic.co/guide/en/kibana/current/saved-objects-api-get.html">Saved Objects</a> endpoint for the currently selected <a href="https://www.elastic.co/guide/en/kibana/current/data-views.html">Data View</a> (code still targets type:index-pattern which was its name &lt;v8.0).</p><pre>POST /api/saved_objects/_bulk_get <br>[{&quot;id&quot;:&quot;${INDEX_PATTERN_ID}&quot;,&quot;type&quot;:&quot;index-pattern&quot;}]</pre><p>This Kibana API search-forwards to Elasticsearch API under the Saved Object’s backing <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/alias.html">Alias</a> .kibana . I’m not certain on the query translation, but it’d be something like:</p><pre>GET .kibana/_search<br>{&quot;query&quot;: {&quot;bool&quot;: {&quot;filter&quot;: [{&quot;bool&quot;: {&quot;should&quot;: [{<br>  &quot;match_phrase&quot;: {&quot;_id&quot;: &quot;index-pattern:INDEX_PATTERN_ID&quot;}<br>}]}}]}}}</pre><p>Note, Saved Objects look-up by the Data View’s id and not title or name . If you <a href="https://www.elastic.co/guide/en/kibana/8.6/managing-saved-objects.html#managing-saved-objects-export-objects">export/import</a> or <a href="https://www.elastic.co/guide/en/kibana/current/spaces-api-copy-saved-objects.html">copy</a> Saved Objects between <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Spaces</a> or Elasticsearch clusters, you may Visualization/Dashboard/Discover error about your underlying id having changed during import(; see the Saved Object’s import module to avoid). To demonstrate these fields’ difference:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WE9LjCopWYh-dGD0YCf8uA.png" /></figure><p>👻 If this effects you, during page load, you’ll expect a bottom-right warning/error module similar to:</p><pre>&quot;INDEX_PATTERN_ID&quot; is not a configured data view ID</pre><p>This error is reported in context of the current <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a> and does not qualify if the Data View does/not exist in a different Space.</p><h4>2. Load Fields</h4><p>Next the Kibana UI will load a compilation of backing indices’ related fields.</p><p><strong>API</strong>. First, it will API request:</p><pre>GET /api/index_patterns/_fields_for_wildcard?pattern=INDEX_PATTERN&amp;meta_fields=_source&amp;meta_fields=_id&amp;meta_fields=_index&amp;meta_fields=_score</pre><p>This API will re-trigger every time the user selects a Data View in the top-left. On the back end, Kibana is returning indices from Elasticsearch’s <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-resolve-index-api.html">Resolve Index</a> API and then compiling said indices’ Mappings. I don’t (yet) know of an equivalent request to Elasticsearch to compare to.</p><p>👻 This API’s response time is drastically impacted by <a href="https://www.elastic.co/blog/found-crash-elasticsearch#mapping-explosion">mapping explosion</a> which can partially be diagnosed by this API’s un/compressed response size. Usually this will relate to how many varying <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">Index Mappings</a> are loaded, but can also result from overriding <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html">Mapping limits</a>. This usually returns (far) below 3s but you should definitely consider ≥10s a slow.</p><p>👻 Errors have historically occurred from field name conflicts between indices. You’d want to fix the underlying Index Mappings, but can also also <a href="https://gist.github.com/stefnestor/3e428f621cbcf43e7b4ada2b3e855f82">apply a Runtime field</a> as a temporary override to fix the stray indices’ mapping type.</p><p><strong>JS</strong>. Once the API results return, <em>if</em> the left drawer (showing “Selected Fields” and “Available Fields”) is open, then the browser JavaScript will do summary analytics on these fields. If slow, this will appear in the browser Network tab as API request ended but following (3) request did not start attempting for multiple seconds. Users normally only notice ≥10s.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DrDqXVV-5yVd_S_0Zjq01Q.png" /></figure><p>💡This JavaScript compilation time is diagnosed via the browser DevTool’s Performance tab (e.g. <a href="https://developer.chrome.com/docs/devtools/performance/">Chrome</a>, <a href="https://profiler.firefox.com/docs/#/">Firefox</a>; can also export HAR-like equivalent for sharing).</p><p>👻 Long durations have previously occurred from attempting to compile cardinality on individual fields having <em>extremely </em>long strings with high (or no) ignore_above in their <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">Index Mappings</a>.</p><h4>3. Load Search</h4><p>Lastly, the browser page will make an API search request. This API search request passes through the Kibana server but (should) take nearly the same amount of time as making the Elasticsearch API request directly.</p><p><strong>API</strong>. This URI defaults to</p><pre>POST /internal/bsearch {REQUEST_BODY_HERE}</pre><p>But if <a href="https://www.elastic.co/guide/en/kibana/current/advanced-options.html">Advanced Setting</a> courier:batchSearches: false (&lt;v8.0) then this will instead API request</p><pre>POST /internal/_msearch {REQUEST_BODY_HERE}</pre><p>🤔 (To assist quick page searching: <em>#inspectViaDevTools .</em>) If this search takes a while to process, usually we’ll drop the look back time as minimal as possible (e.g. 1-5mins). Then we’ll navigate Discover &gt; Inspect &gt; Request &gt; Open in Console (aka. <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">DevTools</a>). Visually:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dHPzwpz5FdeTzkrEm6Bzsg.png" /></figure><p>We’ll then run this API search request both in <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">DevTools</a> and separately via Elasticsearch API curl, noting the response time difference between Discover, DevTools, and the Elasticsearch API.</p><p>👻 If Discover to DevTools: TBH, I’m not sure how to interpret that and would have expected the duration padding around but not on the literal the API request. If DevTools to Elasticsearch: the Kibana server throughput is backed up which can next-step be introspected via its <a href="https://www.elastic.co/guide/en/kibana/current/task-manager-health-monitoring.html">Task Manager Health API</a> (≥7.11).</p><p>👻 If Elasticsearch is also just as slow as the other two: We may suspect an un-optimized search/filter in our original Discover view. If no filters/searches are applied (or reproducing with none applied), we’ll confirm general Elasticsearch performance via <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html">CAT Nodes</a> , <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-thread-pool.html">CAT Threadpools</a> (esp. search threads), and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-tasks.html">CAT Tasks</a> (for <a href="https://medium.com/@stefnestor/elasticsearch-tasks-a77f6b0cb558">long running tasks</a>). If no cluster-wide issue found, we’ll compare search response durations between Data Views and then compare these searches’ related <a href="https://www.elastic.co/guide/en/kibana/current/xpack-profiler.html">Query Profiling</a> (after injecting profile: true in our search request body).</p><p><strong>JS</strong>. After the API results return, the browser’s JavaScript kicks in to load the 1) display table summary (the middle-bottom “Documents” table where you can toggle column view on/off) and 2) “Field Statistics” (in <em>beta</em>, toggle in <a href="https://www.elastic.co/guide/en/kibana/current/advanced-options.html">Advanced Settings</a> via discover:showFieldStatistics ).</p><p>👻 These both, again, will be effected by <a href="https://www.elastic.co/blog/found-crash-elasticsearch#mapping-explosion">Mapping Explosion</a>, but its impact was drastically reduced in v7.17.8/v8.5.2 via <a href="https://github.com/elastic/kibana/issues/144673">kibana#144673</a>. (Thanks Kibana Dev for optimizing that with/for me!) Mapping Explosion may surface browser-specific errors, such as Chrome’s Error: maximum call stack size exceeded which reproduces incognito, does not occur in Firefox/Safari, and sometimes only resolves via upgrading Chrome.</p><h3>Closing</h3><p>👻 (To assist quick page searching: <em>#devToolsAuto</em> .) While troubleshooting potential Mapping Explosions, <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">DevTools</a> may respond slower than <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a> and top-left icon load when no requests are expected due to URI</p><pre>GET /api/console/autocomplete_entities?fields=true&amp;indices=true&amp;templates=true&amp;dataStreams=true</pre><p>This is controlled via DevTools &gt; Settings &gt; “Autocomplete” by disabling (at least) <em>Fields</em> and increasing the <em>Refresh Frequency</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eInUC55rQDt89JsTU_PCSA.png" /></figure><p>These requests can bog down the 1) local browser causing page crashes or “<em>wait for page</em>?” banners and 2) Kibana Server depending on frequency and expensiveness. (I believe that) this change is specific to logged in user.</p><p>👋 Welcome to the end! I tried to keep it brief but really only needed a public place to link my screenshot for <em>#inspectViaDevTools — </em>😂</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8f6b4fd7df9f" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>