Stories by Stef Nestor on Medium

Elastic Node Hot Threads

Stef Nestor — Wed, 11 Jun 2025 19:16:04 GMT

Troubleshooting with and interpreting Elasticsearch Node Hot Threads.

Node Hot Threads API response

I previously covered investigating Elasticsearch via SRE’s USE Method. From there, I flushed out Elastic’s Common Issues which systematically covers those sequential top resource concerns for administrating Elastic: disk watermark (not covered here), then CPU usage, then JVM heap, and then task throughput. TL;DR of prior: hardware CPU+JVM resource usage influences software task throughput (and vice-versa), but these API outputs can only be correlated not one-to-one associated.

Today we’ll dive into interpreting Node Hot Threads (old link) as part of investigating any of these three common issues.

Elasticsearch runs open source Lucene which runs a JDK. The JDK’s sub-JVM allows for polling Java (not direct CPU) threads and heap dumps. (Side: See Oracle’s GC explanation for why we check CPU/threads first then heap after.) These outputs frequently point to high-volume/expensive code paths as referenced by Elasticsearch loggers (example list, usually prefixing org.elasticsearch).

API Response

The Elasticsearch service can run multiple Java threads per single hardware CPU thread within an Elastic-defined thread pool or default transport_worker pool (see previous Elasticsearch Tasks). To enable responsiveness on even the most struggling nodes, Elastic returns a pretty unsophisticated response of just an unsorted snapshot list of stacktraces noting repeat counts. So the API response examples and templates like …

# GET _nodes/hot_threads

# example
::: {instance-0000000001}{9fVI1XoXQJCgHwsOPlVEig}{RrJGwEaESRmNs75Gjs1SOg}{instance-0000000001}{10.42.9.84}{10.42.9.84:19058}{himrst}{8.18.2}{7000099-8525000}{region=unknown-region, server_name=instance-0000000001.b84ab96b481f43d791a1a73477a10d40, xpack.installed=true, transform.config_version=10.0.0, ml.config_version=12.0.0, data=hot, logical_availability_zone=zone-1, availability_zone=us-central1-a, instance_configuration=gcp.es.datahot.n2.68x10x45}
   Hot threads at 2025-05-14T17:59:30.199Z, interval=500ms, busiestThreads=10000, ignoreIdleThreads=true:
   
   88.5% [cpu=88.5%, other=0.0%] (442.5ms out of 500ms) cpu usage by thread '[write]'
     8/10 snapshots sharing following 29 elements
       com.fasterxml.jackson.dataformat.smile@2.17.2/com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:434)
       org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doAdd(LocalBulk.java:69)
       # ... 
     2/10 snapshots sharing following 37 elements
       app/org.elasticsearch.xcontent@8.16.1/org.elasticsearch.xcontent.support.filtering.FilterPath$FilterPathBuilder.insertNode(FilterPath.java:172)
       # ... 


# template
::: {NAME}{ID}{UNK}{HOST_NAME}{ADDRESS}{UNK}{ROLES}{VERSION}{UNK}{ATTRIBUTES}
   Hot threads at TIMESTAMP, interval=INTERVAL_FROM_API, busiestThreads=THREADS_FROM_API, ignoreIdleThreads=IDLE_FROM_API:
   
   TOTAL_CPU% [cpu=ELASTIC_CPU%, other=OTHER_CPU%] (Xms out of INTERVAL_FROM_API) cpu usage by thread 'THREAD'
     X/SNAPSHOTS_FROM_API snapshots sharing following X elements
       STACKTRACE_SAMPLE
       # ... 
     X/SNAPSHOTS_FROM_API snapshots sharing following X elements
       STACKTRACE_SAMPLE
       # ...

… where most of this output reports the API’s inputs/defaults and then literal stacktrace samples which we’ll ignore here. What we care to note is

the first row of ::: reports the node’s name and roles
the first thread reports the thread info as write thread pool related
CPU shows up three times as (unofficial jargon:) as total CPU, Elasticsearch-used CPU, and “other” CPU (for disk/network IO and/or GC)
the logger ended up being org.elasticsearch.xpack.monitoring.exporter.local so someone’s doing legacy local monitoring

Confirming from this output that within the requested time frame the Java thread may have responded to multiple tasks (reference) so this output cannot report the direct task ID (reference), but it is still helpful for generally knowing where you spend the majority of your CPU time.

While Troubleshooting

Some commentary on integrating this output while troubleshooting

CPU

If Elastic-used CPU remains +95% (high CPU usage), you expect to see correlating threads (even if they rotate) as CPU can’t be high without active code being ran. I find either it’s an expensive task (so returns on first poll, see later example) or it’s a spattering of high-intensity but fast tasks (so you may have to poll a couple times in quick succession to catch).

CPU-to-GC

As elevated CPU may trigger GC (if only via backup of JVM hitting 95%), you’ll expect other to be elevated along with Elasticsearch logs reporting multiple garbage collection cycles. Noting for descending threads-per-node, the time in other will usually reflect higher. Elevated JVM heap percent (ignore ram percent) won’t fall until GC successfully reaps the heap.

If CPU usage is consistently low from both expensive and high-intensity fast tasks but JVM Memory Pressure remains high, this is when you’d suspect an unreaped task (for example ≤v8.13 searches not always reap) or potential memory leak. In my experience, I’m usually wrong that CPU symptoms don’t show but on the edge case where they really don’t show, this is when I pull+analyze a JVM heap dump.

CPU-not-GC

If Elasticsearch CPU is low but other and total CPU are high then you’ll be looking for a disk/network IO issue. In my experience, usually this will reflect outside of Elasticsearch as the entire host struggle-bussing an obvious disk/network issue.

The only main exception to obvious disk/network IO that I’m aware of is Dynatrace has oneagentautosensor which can eat up all available CPU (usually during performance issues as it doesn’t always back off polling, which is ironic) who’s thread samples look like …

   100.0% [cpu=0.6%, other=99.4%] (500ms out of 500ms) cpu usage by thread 'oneagentautosensor'
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
   
   100.0% [cpu=0.1%, other=99.9%] (500ms out of 500ms) cpu usage by thread 'oneagentsubpathsender REDACTED'
     # same unique snapshot repeat before
   
   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'oneagentperiodicrequests'
     # same unique snapshot repeat before
   
   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'oneagentallocationprofiling'
     # same unique snapshot repeat before

… and at which point your only answer is to disable Dynatrace monitoring until you get the Elasticsearch node stable.

Logger-to-Tasks

For high CPU tasks, sometimes it’s helpful to compare these against the Long-running Node Tasks. The analysis is correlative but cannot be lined-up one-to-one, but is usually quite good at finding expensive code path usage.

The most common built-in one in my experience is org.elasticsearch.search.aggregations.bucket.composite indicating a Composite Aggregations within a Search task even though its documentation has a full performance “you must seriously load test this” disclaimer warning

If this logger flagged, per Long-running Node Tasks you would expect to cross-compare this to current searches …

GET _tasks?human=true&detailed=true&actions=indices:data/read/search

… to find out who’s output description includes composite in its JSON.

The most common not-built-in one AFAIK is Runtime custom code (from search or mapping) flagged via logger org.elasticsearch.search.runtime.

Analysis Automations

The following is offered as is with no guarantees and no upkeep. It’s been used against v7.10-v9.0. This is an extraction of my current Python object [^A] to tag common features from a combination substring search across the thread and stacktraces.

If you were prone to put this into a Streamlit UI (for filtering-ease), then for an example frozen tier having future dates inducing high searches from Kibana Rules would appear like …

… where Frozen nodes are doing 99–100% CPU for searches (and some aggregation searches) while hot nodes are doing far less.

[^A] A wrapping function would say if every substring in strs is found in either the thread or its stacktrace, then tag the thread as relating to said feature. Feature names are close to official Elastic documentation but are kind of just figured out based on need/frequency.

TAG_ANALYSIS = [
    {"tag": "alias", "strs": ["org.elasticsearch.action.admin.indices.alias"]},
    {"tag": "alias", "strs": ["org.elasticsearch.aliases"]},
    {"tag": "alias", "strs": ["org.elasticsearch.cluster.metadata.Metadata.findAliases"], },
    {"tag": "alias", "strs": ["org.elasticsearch.index.alias"]},
    {"tag": "allocation", "strs": ["org.elasticsearch.cluster.routing.allocation.allocator"], },
    {"tag": "allocation", "strs": ["org.elasticsearch.cluster.routing.allocation.decider"], },
    {"tag": "allocation.desired", "strs": ["org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator" ], },
    {"tag": "analysis", "strs": ["org.apache.lucene.analysis"]},
    {"tag": "apm", "strs": ["elastic-apm-server-reporter"]},
    {"tag": "ccr", "strs": ["org.elasticsearch.xpack.ccr"]},
    {"tag": "dlm", "strs": ["org.elasticsearch.action.datastreams.lifecycle"]},
    {"tag": "dlm", "strs": ["org.elasticsearch.datastreams.lifecycle"]},
    {"tag": "downsample", "strs": ["org.elasticsearch.action.downsample"]},
    {"tag": "downsample", "strs": ["org.elasticsearch.xpack.core.downsample"]},
    {"tag": "downsample", "strs": ["org.elasticsearch.xpack.downsample"]},
    {"tag": "enrich", "strs": ["org.elasticsearch.xpack.core.enrich"]},
    {"tag": "enrich", "strs": ["org.elasticsearch.xpack.enrich"]},
    {"tag": "evictor", "strs": ["Connection evictor"]},
    {"tag": "fields", "strs": ["org.elasticsearch.action.fieldcaps"]},
    {"tag": "fields", "strs": ["org.elasticsearch.index.fielddata"]},
    {"tag": "fields", "strs": ["org.elasticsearch.indices.fielddata"]},
    {"tag": "fields", "strs": ["org.elasticsearch.search.fieldcaps"]},
    {"tag": "flush", "strs": ["[flush]"]},
    {"tag": "flush", "strs": ["org.elasticsearch.action.admin.indices.flush"]},
    {"tag": "flush", "strs": ["org.elasticsearch.index.engine.Engine.flush"]},
    {"tag": "flush", "strs": ["org.elasticsearch.index.flush"]},
    {"tag": "flush", "strs": ["org.elasticsearch.index.shard.IndexShard.flush"]},
    {"tag": "flush", "strs": ["org.elasticsearch.indices.flush"]},
    {"tag": "forcemerge", "strs": ["[force_merge]"]},
    {"tag": "forcemerge", "strs": ["org.apache.lucene.index.IndexWriter.forceMerge"]},
    {"tag": "geoip", "strs": ["org.elasticsearch.geoip"]},
    {"tag": "geoip", "strs": ["org.elasticsearch.ingest.geoip"]},
    {"tag": "geoip", "strs": ["org.elasticsearch.xpack.geoip"]},
    {"tag": "get", "strs": ["[get]"]},
    {"tag": "get", "strs": ["org.elasticsearch.index.engine.InternalEngine.get"]},
    {"tag": "get", "strs": ["org.elasticsearch.index.get.ShardGetService.get"]},
    {"tag": "grok", "strs": ["org.elasticsearch.grok"]},
    {"tag": "ilm", "strs": ["org.elasticsearch.indexlifecycle"]},
    {"tag": "ilm", "strs": ["org.elasticsearch.xpack.core.ilm"]},
    {"tag": "ilm", "strs": ["org.elasticsearch.xpack.ilm"]},
    {"tag": "ingest", "strs": ["[write]"]},
    {"tag": "ingest.delete", "strs": ["org.elasticsearch.index.engine.InternalEngine.delete"], },
    {"tag": "ingest.delete", "strs": ["org.elasticsearch.index.shard.IndexShard.applyDeleteOperation"], },
    {"tag": "ingest.mapping", "strs": ["[write]", "org.elasticsearch.index.mapper", "parseCreateField"], },
    {"tag": "ingest.mapping", "strs": ["[write]", "org.elasticsearch.index.mapper.ObjectMapper$Builder.build"], },
    {"tag": "keepalive", "strs": ["keepAlive"]},
    {"tag": "logdb", "strs": ["org.elasticsearch.logsdb"]},
    {"tag": "logdb", "strs": ["org.elasticsearch.xpack.logsdb"]},
    {"tag": "logging", "strs": ["Log4j2"]},
    {"tag": "logging", "strs": ["org.apache.logging"]},
    {"tag": "management", "strs": ["[management]"]},
    {"tag": "merge", "strs": ["Lucene Merge Thread"]},
    {"tag": "merge", "strs": ["org.elasticsearch.action.admin.indices.forcemerge"]},
    {"tag": "merge", "strs": ["org.elasticsearch.index.merge"]},
    {"tag": "ml", "strs": ["ml-cpp"]},
    {"tag": "ml", "strs": ["org.elasticsearch.xpack.core.ml"]},
    {"tag": "ml", "strs": ["org.elasticsearch.xpack.ml"]},
    {"tag": "ml", "strs": ["x-pack-ml"]},
    {"tag": "ml.inference", "strs": ["inference_utility"]},
    {"tag": "ml.inference", "strs": ["org.elasticsearch.inference"]},
    {"tag": "ml.inference", "strs": ["org.elasticsearch.xpack.core.inference"]},
    {"tag": "ml.inference", "strs": ["org.elasticsearch.xpack.core.ml.inference"]},
    {"tag": "ml.inference", "strs": ["org.elasticsearch.xpack.inference"]},
    {"tag": "ml.inference", "strs": ["org.elasticsearch.xpack.ml.inference"]},
    {"tag": "ml.inference", "strs": ["xpack.inference"]},
    {"tag": "monitoring", "strs": ["org.elasticsearch.action.admin.cluster.stats"]},
    {"tag": "monitoring.cluster", "strs": ["org.elasticsearch.action.admin.cluster.stats"], },
    {"tag": "pending_task", "strs": ["clusterApplierService#updateTask"]},
    {"tag": "pending_task", "strs": ["org.elasticsearch.cluster.service.ClusterApplierService.applyChanges" ], },
    {"tag": "pending_task", "strs": ["org.elasticsearch.xpack.ilm.IndexLifecycleTransition.newClusterStateWithLifecycleState" ], },
    {"tag": "pipeline", "strs": ["org.elasticsearch.ingest.CompoundProcessor"]},
    {"tag": "pipeline", "strs": ["org.elasticsearch.ingest.Pipeline"]},
    {"tag": "pipeline.if", "strs": ["org.elasticsearch.ingest.ConditionalProcessor"]},
    {"tag": "pipeline.script", "strs": ["org.elasticsearch.ingest.common.ScriptProcessor"], },
    {"tag": "reaper", "strs": ["process reaper"]},
    {"tag": "refresh", "strs": ["[refresh]"]},
    {"tag": "refresh", "strs": ["org.elasticsearch.action.admin.indices.refresh"]},
    {"tag": "refresh", "strs": ["org.elasticsearch.index.engine.InternalEngine.refresh"], },
    {"tag": "refresh", "strs": ["org.elasticsearch.index.refresh"]},
    {"tag": "reindex", "strs": ["org.elasticsearch.index.reindex"]},
    {"tag": "rollup", "strs": ["org.elasticsearch.xpack.core.rollup"]},
    {"tag": "rollup", "strs": ["org.elasticsearch.xpack.rollup"]},
    {"tag": "runtime", "strs": ["org.elasticsearch.runtimefields"]},
    {"tag": "script", "strs": ["org.elasticsearch.script"]},
    {"tag": "script.mustache", "strs": ["org.elasticsearch.script.mustache"]},
    {"tag": "script.painless", "strs": ["org.elasticsearch.painless.PainlessScript"]},
    {"tag": "script.painless.date", "strs": ["org.elasticsearch.script.DateFieldScript"], },
    {"tag": "script.regex", "strs": ["org.elasticsearch.common.regex"]},
    {"tag": "scroll", "strs": [".scroll"]},
    {"tag": "search", "strs": ["org.elasticsearch.action.search."]},
    {"tag": "search", "strs": ["org.elasticsearch.common.lucene.search"]},
    {"tag": "search", "strs": ["org.elasticsearch.index.query"]},
    {"tag": "search", "strs": ["org.elasticsearch.index.search."]},
    {"tag": "search", "strs": ["org.elasticsearch.query."]},
    {"tag": "search", "strs": ["org.elasticsearch.search"], },
    {"tag": "search", "strs": ["org.elasticsearch.xpack.core.search."]},
    {"tag": "search", "strs": ["org.elasticsearch.xpack.search."]},
    {"tag": "search.agg", "strs": ["org.elasticsearch.aggregations"]},
    {"tag": "search.agg", "strs": ["org.elasticsearch.search.aggregations"]},
    {"tag": "search.agg.composite", "strs": ["org.elasticsearch.search.aggregations.bucket.composite"], },  # https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
    {"tag": "search.agg.nested", "strs": ["org.elasticsearch.search.aggregations.bucket.nested"], },
    {"tag": "search.agg.topHits", "strs": ["org.elasticsearch.search.aggregations.metrics.TopHitsAggregator"], },  # https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
    {"tag": "search.eql", "strs": ["org.elasticsearch.xpack.core.eql"]},
    {"tag": "search.eql", "strs": ["org.elasticsearch.xpack.eql"]},
    {"tag": "search.esql", "strs": ["[esql_worker]"]},
    {"tag": "search.esql", "strs": ["org.elasticsearch.xpack.core.esql"]},
    {"tag": "search.esql", "strs": ["org.elasticsearch.xpack.esql"]},
    {"tag": "search.globalOrdinals", "strs": ["org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build", "org.elasticsearch.search", ], },
    {"tag": "search.kql", "strs": ["org.elasticsearch.xpack.core.kql"]},
    {"tag": "search.kql", "strs": ["org.elasticsearch.xpack.kql"]},
    {"tag": "search.mustache", "strs": ["org.elasticsearch.script.mustache", "org.elasticsearch.search"], },
    {"tag": "search.prefilter", "strs": ["org.elasticsearch.action.search.CanMatchPreFilterSearchPhase"], },
    {"tag": "search.runtime", "strs": ["org.elasticsearch.search.runtime", "org.elasticsearch.search"], },
    {"tag": "search.script", "strs": ["org.elasticsearch.search.aggregations.pipeline.BucketScriptPipelineAggregationBuilder" ], },
    {"tag": "searchable", "strs": ["[searchable_snapshots_cache_fetch_async]"]},
    {"tag": "searchable", "strs": ["org.elasticsearch.xpack.core.searchablesnapshots"]},
    {"tag": "searchable", "strs": ["org.elasticsearch.xpack.searchablesnapshots"]},
    {"tag": "searchable.prewarm", "strs": ["[searchable_snapshots_cache_prewarming]"]},
    {"tag": "shrink", "strs": ["org.elasticsearch.action.admin.indices.shrink"]},
    {"tag": "slm", "strs": ["org.elasticsearch.xpack.core.slm"]},
    {"tag": "slm", "strs": ["org.elasticsearch.xpack.slm"]},
    {"tag": "snapshot", "strs": ["[snapshot]"]},
    {"tag": "snapshot", "strs": ["com.amazonaws"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.action.admin.cluster.repositories"], },
    {"tag": "snapshot", "strs": ["org.elasticsearch.action.admin.cluster.snapshots"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.common.blobstore"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.index.snapshots"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.plugin.repository"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.repositories.blobstore"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.repository.azure"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.snapshots"]},
    {"tag": "snapshot", "strs": ["org.elasticsearch.xpack.repositories"]},
    {"tag": "sql", "strs": ["org.elasticsearch.xpack.core.sql"]},
    {"tag": "sql", "strs": ["org.elasticsearch.xpack.sql"]},
    {"tag": "transform", "strs": ["org.elasticsearch.transform"]},
    {"tag": "transform", "strs": ["org.elasticsearch.xpack.core.transform"]},
    {"tag": "transform", "strs": ["org.elasticsearch.xpack.transform"]},
    {"tag": "translog", "strs": ["org.elasticsearch.index.translog"]},
    {"tag": "transport", "strs": ["transport_worker"]},
    {"tag": "vectors", "strs": ["org.elasticsearch.index.engine.Engine.getSparseVectorValueCount"]},
    {"tag": "vectors", "strs": ["org.elasticsearch.index.engine.Engine.sparseVectorStats"]},
    {"tag": "watcher", "strs": ["org.elasticsearch.xpack.core.watcher"]},
    {"tag": "watcher", "strs": ["org.elasticsearch.xpack.watcher"]}
]

Disclaimer: My understanding is my own and view does not reflect Elastic’s; while information core has been verified with Elastic Dev, I recommend always referring to official sources. I am working on integrating the above into official documentation and welcome feedback.

Streamlit + Local LLM + PDFs

Stef Nestor — Mon, 22 Apr 2024 21:07:38 GMT

Building off earlier outline, this TLDR’s loading PDFs into your (Python) Streamlit with local LLM (Ollama) setup. Another Github-Gist-like post with limited commentary.

Playing forward this Google-result and its code when searching “local llm pdfs”. My use case is to load all Apple iCloud iBooks into an “oracle”-GPT for private discussions. A sub curiosity is to have two GPTs responding as their author would (potentially across their multiple respective books). The first building block, covered here, is loading PDFs into a local LLM and confirming its PDF-trained results are more desirable (aka. spot-checked accurate) than the generic model.

Results

Personal test caveats

I’ll only load a single, random PDF from my iBook storage Reinventing Your Life by Jeffrey E. Young & Janet S. Klosko. On Apple Macs, these iCloud PDFs store under ~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents . My test runs from ~/Downloads and while I could easily reference the PDF from the iBooks folder instead of my test folder, that’s step two.
I know llama3 came out last week, but so far it hasn’t shown sufficient improvement for me to move off llama2-uncensored and accept the response censoring.

Comparing the generic LLM (🦙) to the PDF-trained LLM (📓), I was able to compare their results to various questions, e.g.

This image shows the generic LLM hallucinating but the PDF-trained LLM correctly identifying the book’s authors. 👏

Code

The following has no expectations/warranties, but it “works on my machine” (though as proof-of-concept, its code is ugly, I agree).

from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyMuPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import Ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
import streamlit as st

llm = Ollama(model="llama2-uncensored")

@st.cache_resource
class PdfGpt():
    def __init__(self, file_path):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
        chunks = text_splitter.split_documents(documents=PyMuPDFLoader(file_path=file_path).load())
        
        embedding_model = HuggingFaceEmbeddings(
            model_name="all-MiniLM-L6-v2",
            model_kwargs={'device':'cpu'},
            encode_kwargs = { 'normalize_embeddings': True }
        )
        vectorstore = FAISS.from_documents(chunks, embedding_model)
        vectorstore.save_local("vectorstore")
        
        template = """
        ### System:
        You are an respectful and honest assistant. You have to answer the user's questions using only the context \
        provided to you. If you don't know the answer, just say you don't know. Don't try to make up an answer.

        ### Context:
        {context}

        ### User:
        {question}

        ### Response:
        """

        self.hey = RetrievalQA.from_chain_type(
            llm=llm,
            retriever=vectorstore.as_retriever(),
            chain_type="stuff",
            return_source_documents=True, 
            chain_type_kwargs={'prompt': PromptTemplate.from_template(template) } 
        )

oracle = PdfGpt("reinventing_your_life.pdf") # PDF file name
ask = st.text_input("What's up?", key="ask", label_visibility='hidden')

A,B = st.columns([.05, .95])
C,D = st.columns([.05, .95])
with A:
    st.caption("🦙")
with C:
    st.caption("📓")

if ask not in [None, "", []]:  
    with B:
        st.markdown( llm.predict(ask) )
    with D:
        response = oracle.hey({'query': ask})
        st.markdown( response['result'] )

Say you call this file test.py , you’d run it (in a test where you’re okay with test data caching) after updating the PDF file name reference reinventing_your_life.pdf to your own test PDF, and then starting up Streamlit via streamlit run test.py .

👋

USE Method for Elasticsearch

Stef Nestor — Fri, 01 Dec 2023 17:49:55 GMT

Applying the USE Method for troubleshooting down/impaired.

We’re going to outline SRE incident triaging for Elasticsearch via the infamous USE Method with metaphor parallel to medical ABCDE triaging. (If the metaphor doesn’t work for you, kindly ignore. I highly recommend USE Method familiarity for general technical troubleshooting, but it’s not a prerequisite to apply below.)

EUS Method

“USE” or the trio “Utilization, Saturation, and Errors” outlines for SRE’s to sequentially check “EUS” (though “USE” is easier to remember/reference):

Errors (for literal API/log/stat errors)
Utilization (of resources, mainly cpu, heap , disk, network )
Saturation (of queues, e.g. thread pools, tasks)

The USE Method blog calls out that to meaningfully investigate these three, investigators first have to determine which system processes (or metaphorically “life blood”) to track.

Applied

For Elasticsearch, thankfully these are (mostly) explicitly tracked as “tasks” (as we’ve historically covered), but highlighting tasks can induce from and correspondingly surface under

cluster state/discovery: Pending Tasks
plugins (varies but top two): ILM Explain, Allocation Health
external traffic: Node Tasks, Threadpools

From this outline, most performance issues can be discovered from only a handful of API’s: Allocation Health, CAT Pending Tasks, CAT Tasks, ILM Explain, CAT Threadpools, CAT Nodes, CAT Allocation. (If this seems like too many, Elasticsearch Devs agree and have been working on an undocumented Internal Health — not covered here.)

CheatSheets

I wish Medium allowed clickable-PDF-iFrames, but I expect their screenshot versions are at least sufficient breadcrumbs. Therefore, kindly note each underline is an Elasticsearch doc(, usually the API page that returns from that literally searched by that string). You can access the PDF version here.

Triage

Flushed Out

(Non-exhaustive obviously, but most common in my experience.)

Examples

From Elasticsearch diagnostics, which mass poll the above and other API endpoints, we can distill a Python Streamlit UI to sequentially investigate. (I do recommend alerts as primary triage mechanism instead of manually checking reports, but reports do add needed coloring while flushing out alerting and for those of us with “trust but verify” trust issues.)

Disclaimers: The following examples were force-derived so won’t fully reflect real triaging scenarios. No screenshots necessarily implicate triaging recommendations vs reflect my automation’s current status.

Errors+Saturation

A large cluster tripping over ingest requests hitting faster than able to process, partially due to ingest Hot Spotting:

Where we earlier covered troubleshooting: Breakers+Pressure, Threadpools.

Utilization

A small cluster hitting disk watermark and historical ingest hot spotting:

Reference

For those who use Elastic’s Elasticsearch diagnostic, here’s a spatial diagram lined up to this discussion (so color-coded to earlier diagrams) of the pulled file categorization to its backing API (as pulled from this code):

(Python) Streamlit + Local LLM

Stef Nestor — Thu, 30 Nov 2023 00:29:47 GMT

Yet-Another-Code-Example for ChatGPT-like localhost LLM

👋 Howdy, y’all. I’m skipping most context/commentary and treating Medium like a Github Gist for this post.

Goal: My friends have been excited by ChatGPT and wanting to run offline, uncensored models but have been experiencing start-up frictions. I want to outline the 5min (including download time) way to get running and the 5min after to get a UI on top. AFAICT this write-up is unique to internet previous, but only someone’s better google-foo will tell.

1) Install Ollama

The last 9 months the internet has been figuring out the preferred way to run LLMs locally: Reddit, top 5 blog, LangChain. Dealers choice, but we’re just going to go Ollama to get llama2-uncensored (means it won’t say “I shouldn’t tell you that” — lol — and it will also emit the swear words nobody should say). So: Mac download link and then in Terminal initialize models

$ ollama run llama2 # default
$ ollama run llama2-uncensored # 👈 stef default
$ ollama list
NAME                     ID           SIZE   MODIFIED
llama2:latest            a808fc133004 3.8 GB 3 months ago
llama2-uncensored:latest 5823fb1154c5 3.8 GB 3 months ago

That’s it, that’s your command to run ChatGPT-like LLMs locally. (LLMs have various training data and therefore you’ll notice OpenAI’s is still currently shinier than what you can run locally, but let’s run both to vote for open source and open internet.)

2) Streamlit UI

Using Langchain, there’s two kinds of AI interfaces you could setup (doc, related: Streamlit Chatbot (tutorial) on top of your running Ollama. First install Python libraries:

$ pip install langchain duckduckgo-search streamlit

2A) Ask Local Only

For company-private data, you can setup a UI which only uses the local LLM …

import streamlit as st 
from langchain.llms import Ollama
llm = Ollama(model="llama2-uncensored:latest") # 👈 stef default

colA, colB = st.columns([.90, .10])
with colA:
    prompt = st.text_input("prompt", value="", key="prompt")
response = ""
with colB:
    st.markdown("")
    st.markdown("")
    if st.button("🙋‍♀️", key="button"):
        response = llm.predict(prompt)
st.markdown(response)

2B) Search the Internet and Answer

… But if you’re allowed to use your data/question’s context to search the internet, you can have your LLM Google/DuckDuckGo (example with DDG) …

import streamlit as st
from langchain.llms import Ollama
from langchain.agents import AgentType, initialize_agent, load_tools
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks import StreamlitCallbackHandler
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import streamlit as st

llm = Ollama(
    model="llama2-uncensored:latest", 
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)
tools = load_tools(["ddg-search"])
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, handle_parsing_errors=True
)

if prompt := st.chat_input():
    st.chat_message("user").write(prompt)
    with st.chat_message("assistant"):
        st_callback = StreamlitCallbackHandler(st.container())
        response = agent.run(prompt, callbacks=[st_callback])
        # BUG 2023Nov05 can spiral Q&A: https://github.com/langchain-ai/langchain/issues/12892
        # to get out, refresh browser page
        st.write(response)

2A+B) Combined

… And putting those together into just one UI (not pretty but done) …

import streamlit as st
from langchain.llms import Ollama
from langchain.agents import AgentType, initialize_agent, load_tools
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks import StreamlitCallbackHandler
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler

search_internet = st.checkbox("check internet?", value=False, key="internet")
prompt = st.text_input("prompt", value="", key="prompt")

if prompt!="":
    response = ""
    if not search_internet:
        llm = Ollama(model="llama2-uncensored:latest") # 👈 stef default
        response = llm.predict(prompt)
    else:
        llm = Ollama(
            model="llama2-uncensored:latest", 
            callback_manager=CallbackManager([FinalStreamingStdOutCallbackHandler()])
        )
        agent = initialize_agent(
            load_tools(["ddg-search"])
            ,llm 
            ,agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
            ,verbose=True
            ,handle_parsing_errors=True
        )
        response = agent.run(prompt, callbacks=[StreamlitCallbackHandler(st.container())])
        # BUG 2023Nov05 can spiral Q&A: https://github.com/langchain-ai/langchain/issues/12892
        # to get out, refresh browser page
        
    st.markdown(response)

Examples

To run these code snippets saved as home.py , in that folder’s Terminal run …

$ streamlit run home.py

… which will auto-open the browser UI for you. Now you’re ready to start googling Prompt Engineering to get answers formatted how you’d like …

Lastly, I’m unwilling to say better but that’s probably personality in play, but the above can be easily ported back into the Streamlit Chatbot type of fancy UI. I personally want customer data/email summation which doesn’t need this level of UI, but here’s the shiny:

Elasticsearch Ingest Rejections

Stef Nestor — Fri, 24 Nov 2023 23:18:38 GMT

Protections inducing HTTP 429 rejections and common resolutions during Elasticsearch ingest.

For Elasticsearch to protect its JVM heap resources during ingest task execution, its Dev team has coded three layers of protection that if tripped will induce HTTP 429 errors: A) Circuit Breakers, B) Thread Pools, and C) Indexing Pressure.

Protections

A) Circuit Breakers

Circuit Breakers protect the JVM from OutOfMemoryError across various operation types and will induce API response body errors circuit_breaking_exception and log errors CircuitBreakingException and Data too large.

The most frequent breakers during ingest are [ parent , inflight_requests, request ]. In my experience, Circuit Breaker errors are usually more “straw that broke the camel’s back” quantity related rather than latest request’s quality related. From the API error or logs, you can check if this qualifies as “the final straw” via the new bytes reserved section.

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [1045167624/996.7mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1045165504/996.7mb], new bytes reserved: [2120/2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=10631254/10.1mb, model_inference=0/0b, eql_sequence=0/0b, accounting=35724020/34mb]

Also, where you find parent breakers, you can check its child statistics for top-offender child breakers. For this example just above, the stand-outs are [ in_flight_requests , accounting ] where the former is related to literal HTTP/API bytes and the latter is related to Lucene shard overhead.

This specific log’s situation root caused under Bulk API Task backups which also surfaced in the next section (B) below. See also Elastic’s troubleshooting doc.

To check your cluster for Circuit Breakers (which Node Stats are cumulative since Node Uptime), you can run (using JQ for JSON-parsing):

> GET _nodes/stats?human=true&filter_path=nodes.*.breakers

# filtered tripped at least once
$ cat nodes_stats.json | jq -c '.nodes[]|.name as $node|.breakers|to_entries[]|{node:$node, circuitBreaker:.key, tripped_count:.value.tripped}|select(.tripped_count>0)'

If you ever end up with too much time on your hands, like me, you might create a UI (via Python Streamlit) to calculate hourly tripped breakers and allow quick filtering in/out (in this case elected master was parent circuit breaking):

B) Thread Pools

Thread Pools allow Elasticsearch to allocate memory consumption and queue tasks across topics. (Elasticsearch is just re-using the industry thread pool term.) The pool most associated to ingest is write however a) ingest pipeline asynchronous processes may induce tasks under other thread pools and b) writes to system indices pool under [ system_write , system_critical_write ] but since users don’t write to these they rarely come up.

Each thread pool has its own queue and processing limits which if surpassed will induce EsRejectedExecutionException with either QueueResizingEsThreadPoolExecutor or queue capacity in its API error / log. Maxing queues is most commonly associated to Hot Spotting or Circuit Breakers (section (A) above). (I wrote this Elastic-official Hot Spotting doc — with Elasticsearch Dev sign-off— so do highly recommend it and am always open to feedback to improve it further.)

You can inspect current write Thread Pool queues via CAT Threadpool or via CAT API alternatives via the following (again using JQ for JSON-parsing):

> GET _cat/thread_pool/write?v=true&s=n,nn&h=n,nn,q,a,r,c
> GET _nodes/stats?human=true&filter_path=nodes.*.thread_pool.write

# filtered tripped at least once
$ cat nodes_stats.json | jq -rc '.nodes[]|.name as $n|.thread_pool.write|{name:$n, queue: .queue, active:.active, completed:.completed, rejected:.rejected}|select(.rejected>0)'

Same UI mockup conversation above, this was a recent output I produced which ended up showing historical but not current write hot spotting (you’ll note the third data node has a minimal hourly_completed ):

C) Indexing Pressure

Indexing Pressure was introduced v7.9 but still seems to be the least understood Elasticsearch feature (though fairly well documented). This allows Elasticsearch to protect data integrity during write operations (e.g. indexing, shard recoveries, CCR) by reserving heap during [ coordinating, primary, replica ] write phases per write operation.

Here’s a quick diagram of write phases as seen by Elastic’s write model as diagrammed (2019):

Elasticsearch’s write model

(Side note: I like this Medium article by Luiz Sena about an alternative perspective on Elasticsearch’s write model.)

Surpassing limits errors EsRejectedExecutionException with coordinating_and_primary_bytes. In my experience, this usually surfaces under section (B) circumstances above and it’s a coin-toss if (A) or this flags first. If this flags when there’s no evidence for (A) or (B), Elastic historically recommends reducing bulk max size for ingest (which’d allow the Thread Pool to queue and handle throughput on its layer instead as preferred queue mechanism).

To inspect current and/or historical via Node Stats (again using JQ for JSON-parsing):

> GET _nodes/stats?human=true&filter_path=nodes.*.indexing_pressure

# manually check against limit, noting replica max is 1.5*limit
$ cat nodes_stats.json | jq  -c '.nodes[]|select(.thread_pool.write.queue>0)|{node:.name, limit:.indexing_pressure.memory.limit, all:.indexing_pressure.memory.current.all, c_and_p:.indexing_pressure.memory.current.combined_coordinating_and_primary, c:.indexing_pressure.memory.current.coordinating, p:.indexing_pressure.memory.current.primary, r:.indexing_pressure.memory.current.replica }'

Root Cause

Assuming one of these three flagged, to resolve we need to first make sure (copied-forward from (B) above) that we’re not Hot Spotting (which means software’s unevenly using available hardware resources). I can’t stress this enough as most common reason for issues of a previously right-sized hardware-vs-software cluster.

After, there’s various Elastic-official and unofficial online blogs which circle the same ballpark of actions covered in Elasticsearch’s troubleshooting docs: EBay, DataDogHQ, various Opster articles.

Settings

Where I usually end up recommending folks consider starting within Elasticsearch settings …

(temporarily) undo any ingest/recovery cluster setting overrides (e.g. I’m looking at everybody who leaves cluster.routing.allocation.node_concurrent_recoveries (doc) overrode and then one day it blows up in their face)
increase target index(/ices) refresh_interval (doc) which defaults 1s but even 5s is fairly unnoticeable to humans but helps the database a lot
right-size number_of_shards (aka. primaries, doc) or target as multiple of (applicable) nodes

… then client-side you’ll look to a) verify you have upstream queue’ing and b) have right-sized its works and bulk sizes to Elasticsearch. For Elastic’s products, the most common related docs for (b) are:

Elastic Agents’ tuning settings under Fleet Settings UI
Logstash pipeline.batch.size and workers
Filebeat output to Elasticsearch via worker, bulk_max_size

Context

The back-end on “why these settings?” relates to Lucene’s (which Elasticsearch sits on top of) performance weighting heavier for its internal merging task than it does for its ingesting task. See the infamous 2011 video/blog for context:

https://medium.com/media/9ccb827c618be083fd176d63ef99e118/href

The takeaway is Lucene/Elasticsearch prefers large initial segment sizes to reduce overall segment merging needs and admins can encourage that via the settings outlined earlier. This is also the root cause on why Elasticsearch emphatically recommends Bulk ingest rather than Doc ingest.

(Y’all probably don’t need this last graphic, but I use it a lot in my conversations and want somewhere public internet to paste it. Realizing now I went right-to-left when western media usually does left-to-right, sorry.) As a final graphic on how Lucene merges to its eventual happiest segment size of 5GB, the majority win is the initial segment sizing which admins majorly encourage via refresh_interval and bulk sizing type settings:

👋

Streamlit + iTerm2 (Python)

Stef Nestor — Tue, 26 Sep 2023 18:19:45 GMT

How-to spin-up iTerm2 session from Python3 Streamlit library’s UI

Hello! This one’ll be short, but documenting this automation building-block to show my team.

Use Case: I use Salesforce’s sfdx CLI to pull Case feeds to my local disk. Annoyingly, it doesn’t have a Python library, so I surfaced its data into a Streamlit UI via Python’s subprocess. Separately, I’d written a Bash automation to minimize Elastic’s Stack Diagnostics polling impact. Naturally, at some point my Alfred automation connecting these two became burdensome to version and share (since it costs $). So I turned towards iTerm2 (my default Mac terminal) for a better, sharable, and free automation.

Write-up

Technical Context

iTerm2 allows Python script automations. You can script against iTerm from Python via pip’s iterm2 library after enabling it. There’s some pretty good beginner examples. (See also Github code for more examples.)

It appears there’s some restrictions on the iterm2 library’s abilities that it can kick off requests but not hear the response like subprocess would be able to do. (Workaround examples: this, this.)

I’m going to skip outlining troubleshooting gotchas and just mention:

Python opening iTerm needs to be done as an asynchronous task via asyncio and requires working around this Streamlit bug.
We’re automating via iterm2 and not subprocess because iTerm loads your Mac’s .bash_profile (which is common entry-point for loading git-versioned Dotfiles) so we don’t have to recreate Bash functions we’re already manually running in iTerm again in our Python code.

Design

I’m going to outline the automation’s MVP since it looks like there’s not that much Google content previously written in this ballpark. The design flow we’d hope for is:

Open iTerm and start streamlit UI streamlit run home.py
In Streamlit’s UI (default localhost:8501 ) have a button to open new iTerm tab
Once new iTerm tab is open, change directory to the Salesforce Case ID and start running Elastic’s diagnostic via pre-built Bash automation

Steps 1 and 2 will be done in the Python code; step 3 will trigger from Python but run off the Bash/Dotfiles’ code. (The Bash/Dotfiles’ code will be explained but not outlined here.)

Code

So we’ll write the Python code under home.py which can be ran via python3 home.py (just in iTerm) and/or streamlit run home.py (displays UI).

(Note: I’ll leave a comment-block in the code below where iterm2 code works unless streamlit code is running to highlight where you may need to pivot from online examples when building out your own automations. Bug results explained at bottom.)

import asyncio 
import iterm2
import streamlit as st

### streamlit bug: https://github.com/streamlit/streamlit/issues/744#issuecomment-1491780114
def get_or_create_eventloop():
    try:
        return asyncio.get_event_loop()
    except RuntimeError as ex:
        if "There is no current event loop in thread" in str(ex):
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
            return asyncio.get_event_loop()
asyncio.set_event_loop(get_or_create_eventloop())
###

async def async_iTerm(connection):
    app = await iterm2.async_get_app(connection)
    window = app.current_window
    if window==None:
        sys.exit("👻 No current iTerm window")
    
    ### BLOCKING ERROR:: websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received
    ## does not work when ran via Streamlit
    # tab = await window.async_create_tab(profile="🥷", command=f"/bin/bash goto {number}")
    ###

    # my iTerm profile is called "🥷"
    tab = await window.async_create_tab(profile="🥷")
    await tab.async_set_title(number)
    
    session = app.current_terminal_window.current_tab.current_session

    # 1. this kicks off Bash commands where "goto" and "diagme" are my custom Dotfile functions
    # 2. adding "\n" on the end submits the command in iTerm so it also executes rather than just populating the text
    await session.async_send_text('echo hello\n')
    await session.async_send_text(f'goto {number}\n')
    await session.async_send_text(f'diagme\n')
    print("👋")

def open_case_iterm(number):
    iterm2.run_until_complete(async_iTerm,number)

# ---
# usually, above code would be under a controller and below under a view of Python MVC model code
# ---

# example Salesforce Case Number ID, would be set by user or dynamically in non-MVP code
number = "01486356"

if st.button("Open iTerm", key="iterm"):
    open_case_iterm(number)

Demo

This MVP is quite minimal on use-case details but proves sufficient technical viability for us to consider it a working automation building block. We’ll start streamlit …

… which automatically opens its UI showing our button. Once we click our “Open iTerm” button …

… iTerm will open a new tab, run echo hello , run my change-directory Dotfile automation goto, and finish by starting my diagnostic automation diagme …

MVP bug: The original iTerm tab may end up reporting connection errors …

Task exception was never retrieved
future:  exception=ConnectionClosedError(None, Close(code=1000, reason=''), None)>
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 959, in transfer_data
    message = await self.read_message()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1029, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1104, in read_data_frame
    frame = await self.read_frame(max_size)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1161, in read_frame
    frame = await Frame.read(
            ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/framing.py", line 68, in read
    data = await reader(2)
           ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.5/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/streams.py", line 731, in readexactly
    raise exceptions.IncompleteReadError(incomplete, n)
asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of 2 expected bytes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/connection.py", line 309, in _async_dispatch_to_helper
    if await helper(self, message):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/notifications.py", line 550, in _async_dispatch_helper
    await handler(connection, sub_notification)
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/app.py", line 380, in _async_focus_change
    await self.async_refresh()
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/app.py", line 274, in async_refresh
    layout = await iterm2.rpc.async_list_sessions(self.connection)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/rpc.py", line 33, in async_list_sessions
    return await _async_call(connection, request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/rpc.py", line 884, in _async_call
    await connection.async_send_message(request)
  File "/opt/homebrew/lib/python3.11/site-packages/iterm2/connection.py", line 254, in async_send_message
    await self.websocket.send(message.SerializeToString())
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 635, in send
    await self.ensure_open()
  File "/opt/homebrew/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 935, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received
👋

… which appear to be non-blocking when commands are sent via session.async_send_text instead of under the window.async_create_tab command so ignoring for MVP purposes …

🦖 happy coding

Elasticsearch Data Health

Stef Nestor — Mon, 04 Sep 2023 21:56:32 GMT

JQ commands to troubleshoot Cluster Health yellow/red.

Theory

Elasticsearch reports a single status under Cluster Health to represent the roll-up of the data’s health via its shards’ status and indices’ health . Elastic Cloud elaborates this summary for a Deployment under its Health menu:

Elastic Cloud > Deployment > Health

The Elastic Cloud UI correlates warnings/errors to Elasticsearch’s yellow/red definitions:

Health status of the cluster, based on the state of its primary and replica shards. Statuses are:

green : All shards are assigned.

yellow : All primary shards are assigned, but one or more replica shards are unassigned. If a node in the cluster fails, some data could be unavailable until that node is repaired.

red : One or more primary shards are unassigned, so some data is unavailable. This can occur briefly during cluster startup as primary shards are assigned.

Elasticsearch docs then state you next introspect why the shard(s) aren’t allocating via Allocation Explain and from there data recovery becomes situationally unique and potentially on a shard-by-shard basis. (Most common situations are outlined here and here.)

After finding Mincong Huang’s Shard Allocation Deciders article, I became intrigued to figure out a way to pull an aggregated view of problematic shards with their causes/solutions.

This Allocation summary has been an ongoing discussion, example here and here. Since I haven’t distilled as fast at I projected I might, I wanted to share the stop-gaps I’ve learned so far to deepen insight and outline my investigate flow to speed up recovery.

Summarize

Since Cluster Health is low-level effectively just reporting status:UNASSIGNED shards (which isn’t 100% true if shards are recovering from snapshot but leaving that aside), my first question was where to best pull this data. Hence my ES CAT Alternatives investigation. From this I learned there’s a “routing table” inside the Cluster State:

> GET _cluster/state/routing_table?filter_path=routing_table.indices.*.shards.*.unassigned_info

This actually has more (though still brief) data than CAT Shards (e.g the time the shard became unassigned). Once you have this Cluster State stored locally as cluster_state.json , you can use a third-party tool JQ (or your favorite tool) to JSON-parse this response into meaningful aggregations. The two I prefer to summarize are:

Cause

This is the system-stored reason for the shard becoming unassigned

$ cat cluster_state.json | jq -rc '.routing_table.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[] as $v|select($v.state=="UNASSIGNED")|[$v.unassigned_info.reason]|@tsv' | sort -r | column -t | head

This outputs a table of causes and shard counts for that bucket. Example causes:

NODE_LEFT : Most common in my experience by far. This is when a node ungracefully leaves the cluster (e.g. either from network issues or resource overwhelm).
CLUSTER_RECOVERED : In my experience, this is a second-step to the bullet above when only replicas remain to allocate.
PRIMARY_FAILED : In my experience, this happens when you have disk corruption. It is very uncommon, but e.g. when you have security scanners running against Elasticsearch’s data.path .

Knowing the cause is usually most of the battle towards determining the resolution. At the very least, it’s helpful to know how many resolution paths you’ll need to do to recover all data (rather than Elasticsearch recovery has historically, unnecessarily been socially phrased as a resolve x and see if there’s still an x+1).

Status

Allocation “status” is as close as we’ll get towards a database recommended recovery path, but it’s still fairly informative. There’s three data-points (two more from the one above) from Allocation Explain which are kept in the Cluster State and not just dynamically determined when the API is executed.

$ cat cluster_state.json | jq -rc '.routing_table.indices|to_entries[]|.key as $i|.value.shards|to_entries[]|.key as $s|.value[] as $v|select($v.state=="UNASSIGNED")|[$v.primary, $s, $i, $v.unassigned_info.allocation_status, $v.unassigned_info.reason, $v.recovery_source.type]|@tsv' | sort -r | column -t | head

Please note, there’s no official documentation that I can find on interpreting these, so these are my working definitions:

allocation_status : designates if a shard has attempted recovery or not and what was determined. E.g. no_attempt indicates hitting Cluster Reroute with retry_failed=true and then it resolves or updates to something else. E.g. alternatively no_valid_shard_copy indicates data corruption requiring re-ingesting or restoring from snapshot.
reason : This is the “cause” described above, hence why I was saying it was “most of the battle” on understanding recovery. This is also how I determine root-cause analysis (RCA) for my clusters.
recovery_source : designates where the correct data is expected to be. E.g. PEER is when a replica copies off a primary. E.g. SNAPSHOT is when a Searchable Snapshot or restore from snapshot is processing. E.g. EXISTING_STORE corresponds to no_valid_shard_copy frequently.

Automate

At this point, you’ll now have a high-level overview of all status:UNASSIGNED shards in your cluster. In my personal setup, I have Python automations to react to different data points. The following examples for the mentioned data points may potentially apply to your use case as well but should be analyzed first to guarantee cluster integrity first-and-foremost:

Receiving no_attempt kicks off Cluster Reroute with retry_failed=true as commented above. This is quite safe.
As long as CAT Nodes reports healthy cpu and heap , then for CLUSTER_RECOVERED , THROTTLED , or PEER I’ll temporarily override Cluster Settings for recovery rates: cluster.routing.allocation.node_concurrent_recoveries (doc), indices.recovery.max_bytes_per_sec (doc).
I wrote an automation which does a full investigation and recovery for when it finds no_valid_shard_copy . Thankfully it’s infrequent, but it’s always a heart wrenching notification to receive.

Manual

Allocation issues come in wide and varied flavors. The handful of most frequent are +80% of the problems I work on, so automations are meaningful time investments which maintain your uptime SLO’s during database outages. However, at some point you will encounter the other -20% and need to manually recover.

At this point, Elasticsearch’s guide appropriately tells you to load Allocation Explain to execute the live report of why the shard won’t currently allocate. But then the training wheels come off and you’re left to determine which deciders matter to you and how to interpret how to bypass them.

I’m not going to have a case-by-case answer, but let me outline the files I usually cross-compare when it’s a software-based issue:

Cluster Settings : Mostly *.routing.allocation.* . Comes up settings are literally called out in the output or when no shards or no shards of a type (e.g. replicas) can allocate. Also most frequent settings-based errors too many shards or maximum shards open relate to cluster.max_shards_per_node .
Index Settings : Specifically .routing.allocation to determine any conflicting settings. Usually will be ILM’s .include._tier_preference conflicting with not fully deprecated Node Attribute settings. Which brings up …
ILM Policies and ILM Explain : To understand where the index thinks it wants to be.
CAT Recovery : When flagged active_only=true this lets you watch the shard’s data recovery pace. (For those like me who trust but default verify.)

I still think there’s a lot of unorganized low-hanging fruit which’ll allow this conversation to standardize and automate further; but for now, this is the stop-gap I’ve used to build my automations.

Logstash Copy Elasticsearch Doc ID

Stef Nestor — Fri, 14 Apr 2023 22:03:55 GMT

TLDR on copying the Index Document’s _id into another field in Logstash.

When copying data between Elasticsearch clusters, sometimes we’ll want to copy-save the original _id for long-term referencing. As a quick Logstash export example from an Elastic Cloud Elasticsearch cluster, we can setup a test pipeline

$ pwd
~/downloads

$ cat test.conf
input {
  elasticsearch {
    cloud_id => "REDACTED"
    cloud_auth => "elastic:changeme"
    index => ".internal.alerts-security.alerts-default-000001"
    size => 1
    docinfo => true
    docinfo_target => "[@metadata]"
  }
}

filter {
  mutate {
    add_field => { "doc_id" =>  "%{[@metadata][_id]}" }
  }

  prune {
    whitelist_names => [ "doc_id" ]
  }
}

output {
  stdout {
    codec => rubydebug { metadata => true }
  }
}

Where Mutate is the core Logtash Filter we’re going for and Prune just simplifies our example. I’ll highlight, we needed to enable Logstash Elasticsearch Input’s doc_info and docinfo_target to make this work.

Once setup, we can use Docker to run our example pipeline

$ docker run --name logstash --rm -it -p 5044:5044 -p 9600:9600 -v ~/downloads/:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:8.6.2

Which will emit, for our test, the original _id and our copied doc_id field.

{ 
  "@metadata" => {
    "_id" => "fb68ad73cd755f8f8d7e638cd77acde1e03ab057581a479508987c661ad1de69",
    "_index" => ".internal.alerts-security.alerts-default-000001",
    "_type" => nil 
  },
  "doc_id" => "fb68ad73cd755f8f8d7e638cd77acde1e03ab057581a479508987c661ad1de69"
}

🎉

Snippet: Air gap Elastic

Stef Nestor — Mon, 03 Apr 2023 15:47:10 GMT

Summary of setup considerations for air-gapped environments.

Need

Some organizations require their computer network to be air-gapped. Elastic’s core products can be easily downloaded (via direct or docker) and transferred to satisfy this requirement. After, users still sometimes encounter start-up errors due to miss configured sub-features still attempting to reach out across the internet for supplementary data, especially sub-domains to elastic.co.

Response

The Elastic ecosystem will not randomly reach out to [epr,artifacts].elastic.co but may when e.g. installing ECE, Elasticsearch’s GeoIP (same for Logstash), installing/upgrading Agents or Fleet setups. (As applicable to your use case, linked docs show how to pivot for air-gapped environments.) For exhaustiveness, other Elastic endpoints/IPs your setup could attempt would relate to Maps or Telemetry. Not Prebuilt Rules. Potentially surfacing from third-party dependencies, non-Elastic endpoints would be Logstash plugin installs (to rubygems.org).

Diagnose Kibana Discover

Stef Nestor — Sat, 21 Jan 2023 19:32:21 GMT

Now an official Elastic blog (with some Dev corrections).

Troubleshooting guide for the Elastic company’s Kibana product’s Discover UI view related to long loads, time outs, and errors. (Not reviewing data ingest lag or data quality after full page load.)

Kibana’s Discover page v8.6.0, loaded on defaults

Summary

Discover is Elastic’s core Kibana UI to search, filter, and inspect (time series) data. Visualizations are used for data aggregations/summaries. The Discover UI is resilient to large data Elasticsearch responses, but can sometimes experience issues due to (uncompressed) response size, mapping explosion, and browser limits. Below we’ll summarize most common historical issues and the sequential troubleshooting walk through.

Walkthrough

After establishing and loading a user session, Kibana will load Discover via base URI /app/discover (or its related Kibana Space specific URI). To load this page, the browser page will sequentially request three API’s from the Kibana server (and through Kibana to the below Elasticsearch server as needed).

💡If the Kibana page errors on load, you’ll want to open your browser’s network tab to confirm which sequential request ends up failing. You can share your findings by exporting a HAR log.

1. Load Index Pattern

The browser page will request Kibana’s Saved Objects endpoint for the currently selected Data View (code still targets type:index-pattern which was its name

POST /api/saved_objects/_bulk_get 
[{"id":"${INDEX_PATTERN_ID}","type":"index-pattern"}]

This Kibana API search-forwards to Elasticsearch API under the Saved Object’s backing Alias .kibana . I’m not certain on the query translation, but it’d be something like:

GET .kibana/_search
{"query": {"bool": {"filter": [{"bool": {"should": [{
  "match_phrase": {"_id": "index-pattern:INDEX_PATTERN_ID"}
}]}}]}}}

Note, Saved Objects look-up by the Data View’s id and not title or name . If you export/import or copy Saved Objects between Kibana Spaces or Elasticsearch clusters, you may Visualization/Dashboard/Discover error about your underlying id having changed during import(; see the Saved Object’s import module to avoid). To demonstrate these fields’ difference:

👻 If this effects you, during page load, you’ll expect a bottom-right warning/error module similar to:

"INDEX_PATTERN_ID" is not a configured data view ID

This error is reported in context of the current Kibana Space and does not qualify if the Data View does/not exist in a different Space.

2. Load Fields

Next the Kibana UI will load a compilation of backing indices’ related fields.

API. First, it will API request:

GET /api/index_patterns/_fields_for_wildcard?pattern=INDEX_PATTERN&meta_fields=_source&meta_fields=_id&meta_fields=_index&meta_fields=_score

This API will re-trigger every time the user selects a Data View in the top-left. On the back end, Kibana is returning indices from Elasticsearch’s Resolve Index API and then compiling said indices’ Mappings. I don’t (yet) know of an equivalent request to Elasticsearch to compare to.

👻 This API’s response time is drastically impacted by mapping explosion which can partially be diagnosed by this API’s un/compressed response size. Usually this will relate to how many varying Index Mappings are loaded, but can also result from overriding Mapping limits. This usually returns (far) below 3s but you should definitely consider ≥10s a slow.

👻 Errors have historically occurred from field name conflicts between indices. You’d want to fix the underlying Index Mappings, but can also also apply a Runtime field as a temporary override to fix the stray indices’ mapping type.

JS. Once the API results return, if the left drawer (showing “Selected Fields” and “Available Fields”) is open, then the browser JavaScript will do summary analytics on these fields. If slow, this will appear in the browser Network tab as API request ended but following (3) request did not start attempting for multiple seconds. Users normally only notice ≥10s.

💡This JavaScript compilation time is diagnosed via the browser DevTool’s Performance tab (e.g. Chrome, Firefox; can also export HAR-like equivalent for sharing).

👻 Long durations have previously occurred from attempting to compile cardinality on individual fields having extremely long strings with high (or no) ignore_above in their Index Mappings.

3. Load Search

Lastly, the browser page will make an API search request. This API search request passes through the Kibana server but (should) take nearly the same amount of time as making the Elasticsearch API request directly.

API. This URI defaults to

POST /internal/bsearch {REQUEST_BODY_HERE}

But if Advanced Setting courier:batchSearches: false (

POST /internal/_msearch {REQUEST_BODY_HERE}

🤔 (To assist quick page searching: #inspectViaDevTools .) If this search takes a while to process, usually we’ll drop the look back time as minimal as possible (e.g. 1-5mins). Then we’ll navigate Discover > Inspect > Request > Open in Console (aka. DevTools). Visually:

We’ll then run this API search request both in DevTools and separately via Elasticsearch API curl, noting the response time difference between Discover, DevTools, and the Elasticsearch API.

👻 If Discover to DevTools: TBH, I’m not sure how to interpret that and would have expected the duration padding around but not on the literal the API request. If DevTools to Elasticsearch: the Kibana server throughput is backed up which can next-step be introspected via its Task Manager Health API (≥7.11).

👻 If Elasticsearch is also just as slow as the other two: We may suspect an un-optimized search/filter in our original Discover view. If no filters/searches are applied (or reproducing with none applied), we’ll confirm general Elasticsearch performance via CAT Nodes , CAT Threadpools (esp. search threads), and CAT Tasks (for long running tasks). If no cluster-wide issue found, we’ll compare search response durations between Data Views and then compare these searches’ related Query Profiling (after injecting profile: true in our search request body).

JS. After the API results return, the browser’s JavaScript kicks in to load the 1) display table summary (the middle-bottom “Documents” table where you can toggle column view on/off) and 2) “Field Statistics” (in beta, toggle in Advanced Settings via discover:showFieldStatistics ).

👻 These both, again, will be effected by Mapping Explosion, but its impact was drastically reduced in v7.17.8/v8.5.2 via kibana#144673. (Thanks Kibana Dev for optimizing that with/for me!) Mapping Explosion may surface browser-specific errors, such as Chrome’s Error: maximum call stack size exceeded which reproduces incognito, does not occur in Firefox/Safari, and sometimes only resolves via upgrading Chrome.

Closing

👻 (To assist quick page searching: #devToolsAuto .) While troubleshooting potential Mapping Explosions, DevTools may respond slower than Discover and top-left icon load when no requests are expected due to URI

GET /api/console/autocomplete_entities?fields=true&indices=true&templates=true&dataStreams=true

This is controlled via DevTools > Settings > “Autocomplete” by disabling (at least) Fields and increasing the Refresh Frequency.

These requests can bog down the 1) local browser causing page crashes or “wait for page?” banners and 2) Kibana Server depending on frequency and expensiveness. (I believe that) this change is specific to logged in user.

👋 Welcome to the end! I tried to keep it brief but really only needed a public place to link my screenshot for #inspectViaDevTools — 😂