The “Hello World” of LLMs is the Q&A Knowledge Base use case where the app creates embeddings for source documents and feeds a vector storage. Next, for each query it calculates the embedding, fetches top N documents from the vector store and uses the question and knowledge snippets to prompt the GPT-3/4 language models. Some vector stores are 3rd party stores (Pinecone, Weaviate), others in-memory (FAISS, GPTSimpleVectorIndex).
With the sheer amount of these and other demos and applications using popular libraries like langchain and llama-index, the question arises how to analyze new demos to quickly understand what APIs are being called and in which order? How frequently and with which latency? Checking code is tedious and provides only half the answers. Proper Telemetry tooling will help us discover the API endpoints called by the applications.
Automatic Instrumentation for Python applications
OpenTelemetry provides Automatic Instrumentation for Python that comes to the rescue here. Using the Python agent that is attached to the application, it dynamically injects bytecode to capture telemetry from popular libraries and frameworks. The Langchain, OpenAI, and LlamaIndex libraries use Python’s requests
under the hood, so we’ll need to make sure to use opentelemetry-instrumentation-requests explicitly.
To visualize the telemetry, we need to setup some tools first. There is a convenient demo that ships with a docker-compose to boot up the collector and a few more tools. We ignore the two chatty demo apps that run in the background, though those are helpful to check if the telemetry setup works correctly.
$ git clone git@github.com:open-telemetry/opentelemetry-collector-contrib.git
$ cd opentelemetry-collector-contrib/examples/demo
$ docker-compose up -d
[+] Running 7/7
⠿ Network demo_default Created 0.0s
⠿ Container prometheus Started 0.6s
⠿ Container demo-jaeger-all-in-one-1 Started 0.5s
⠿ Container demo-zipkin-all-in-one-1 Started 0.4s
⠿ Container demo-otel-collector-1 Started 0.7s
⠿ Container demo-demo-server-1 Started 0.9s
⠿ Container demo-demo-client-1 Started 1.1s
The collector runs on port 4317, so any other error message than a timeout/connection error, means that the application is running:
$ curl localhost:4317
curl: (1) Received HTTP/0.9 when not allowed
With this setup, in addition to the open telemetry collector, we also get Jaeger UI running under http://0.0.0.0:16686/ which will help in visualizing the calls.
The next step is to run the code with the auto-instrumentation agent. For this we need a few packages to be added as part of the project setup. In the post, I use poetry to manage python projects, but you can install the packages with pip as well. First, we add the instrumentation for the Python requests library and the telemetry exporter.
$ poetry add opentelemetry-instrumentation-requests
$ poetry add opentelemetry-exporter-otlp
Afterwards, we add the agent via:
$ poetry add opentelemetry-distro
and start the application via the agent (see reference) and keep a text logfile:
$ poetry run opentelemetry-instrument --traces_exporter console,otlp \
--metrics_exporter console \
--service_name llm-playground \
--exporter_otlp_endpoint 0.0.0.0:4317 \
python main.py | tee output.log
If the metrics collector on port 4317 is not running correctly, the app will log an error:
WARNING:opentelemetry.exporter.otlp.proto.grpc.exporter:Transient error StatusCode.UNAVAILABLE encountered while exporting traces, retrying in 1s.
In case of SSL handshake issues (or similar ones)
E0423 17:04:25.197068000 6150713344 ssl_transport_security.cc:1420] Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
one can instruct the exporter with an environment variable to ignore SSL errors:
$ export OTEL_EXPORTER_OTLP_INSECURE=true
If this does not help to establish connectivity, try increasing the verbosity of gRPC logging to find the error.
$ export GRPC_VERBOSITY=debug
$ export GRPC_TRACE=http,call_error,connectivity_state
As configured in “traces_exporter”, in addition to the OTLP endpoint, the spans are also written to the console. In the demo I’m running, the code uses gradio to create a simple UI. I was surprised to see four calls made by gradio even if launched without any public sharing: launch(share=False)
.
$ cat output.log | grep http.url
"http.url": "<https://checkip.amazonaws.com/>",
"http.url": "<https://api.gradio.app/gradio-messaging/en>",
"http.url": "<https://api.gradio.app/pkg-version>",
"http.url": "<http://127.0.0.1:7860/startup-events>",
"http.url": "<http://127.0.0.1:7860/>",
"http.url": "<https://api.gradio.app/gradio-initiated-analytics/>",
"http.url": "<https://api.gradio.app/gradio-launched-analytics/>",
"http.url": "<https://api.gradio.app/gradio-launched-telemetry/>",
According to the docs, there is a parameter analytics_enabled
and a GRADIO_ANALYTICS_ENABLED
environment variable. The description is really convoluted though “default: None; If None, will use environment variable or default to True”...
Back in the telemetry data, we also find the expected OpenAI requests:
"http.url": "<https://api.openai.com/v1/engines/text-embedding-ada-002/embeddings>",
"http.url": "<https://api.openai.com/v1/completions>",
The collected telemetry is helpful to understand the amount and latency of the requests over time. Let’s filter in Jaeger UI (http://0.0.0.0:16686/) for one of the URLs using the tag http.url=https://api.openai.com/v1/engines/text-embedding-ada-002/embeddings
:
We see 20 calls in the last hour ranging from 280 to 883 ms. Each call can be expanded to check for a bit more details:
Sadly, the UIs of Jaeger and Zipkin are unexpectedly basic and don’t support wildcard searches, so one still needs to tee
the console output to a file to quickly search for all the called APIs.
Summary
For more rich telemetry information with custom metadata, the calls to OpenAI (and others) would need to be instrumented manually. While Langchain comes with Callbacks that provide access to the API call results, at time of writing LlamaIndex does not have a similar mechanism for OpenAI embeddings.
The automatic instrumentation is an easy way to inspect calls made by Python applications. The early-stage libraries built in the LLM context tend to have lots of defaults that magically make calls to OpenAI or call lots of 3rd party APIs and externalize its vector data. Even supposedly open-source applications for self-hosting proxies to OpenAI, still rely on other 3rd party services. OpenTelemetry provides an easy way to verify external calls in a sandbox environment.
If you’re interested in getting more out of OpenTelemetry, check out the follow-up post, that gets into details of manual instrumentation with OpenTelemetry for Langchain and LlamaIndex.
References
- OpenTelemetry Python Automatic Instrumentation
- OpenTelemetry Exporter Configuration Options (environment variables)
- Building AI Chatbot with ChatGPT API and LlamaIndex: Building a Chatbot
- Dependencies section for poetry:
[tool.poetry.dependencies]
python = "^3.11"
openai = "^0.27.2"
llama-index = "^0.4.32"
gradio = "^3.22.1"
opentelemetry-instrumentation-requests = "^0.38b0"
opentelemetry-distro = "^0.38b0"
opentelemetry-exporter-otlp = "^1.17.0"