Friendli Container Part 2: Monitoring with Grafana
Friendli Container Series: Part 2 of 2
In the second part of our two-part series on learning how to use the Friendli Container, we will learn how to monitor important metrics such as throughput and latency through Prometheus and our customizable Grafana templates, which can be downloaded from our GitHub repository: Friendli Container GitHub Repository. Friendli Container is designed to make the deployment of custom generative AI models simpler, faster, and cheaper. Monitoring and maintaining containers with Grafana helps ensure smooth operations, making it well-suited for production-scale environments.
The basics of Friendli Container have been covered in our previous post, with explanations of containers in general: Friendli Container Part 1: Efficiently Serving LLMs On-Premise. If you’re already familiar with the general container setup and want to jump directly to the section on Grafana, click here to Get Started with Friendli Container x Grafana
Technology used
- Friendli Container: Chat Completions API
- Prometheus
- Grafana Dashboard (with templates)
To effectively monitor and optimize performance, you can integrate Grafana, an open-source analytics and monitoring platform, with Prometheus to observe the performance of Friendli Containers. Friendli Container exports internal metrics in Prometheus text format, and we provide Grafana Dashboard templates that offer enhanced observability, such as the example shown above.
The dashboard visualizes metrics like ‘Requests Throughput’, ‘Latency’, ‘P90 TTFT (Time to First Token)’, ‘Friendli TCache Hit Ratio’, and more from a Friendli Container instance. Friendli TCache optimizes LLM inferencing by caching frequently used computational results, reducing redundant GPU processing. Higher TCache Hit Ratio leads to lower GPU workloads, ensuring faster P90 TTFT, even under varying load conditions.
A Quick Setup
Execute the terminal commands below after acquiring the necessary values as environment variables (e.g. Friendli Personal Access Token) to efficiently run your generative AI model of choice on your GPUs. In this tutorial, we use the Llama 3.1 8B Instruct model to handle the chat completion inference requests.
Refer to the previous blog “Friendli Container Part 1: Efficiently Serving LLMs On-Premise” for detailed instructions on setting up the VM environment.
export FRIENDLI_EMAIL="{YOUR FULL ACCOUNT EMAIL ADDRESS}"
export FRIENDLI_PAT="{YOUR PERSONAL ACCESS TOKEN e.g. flp_XXX}"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_PAT
docker pull registry.friendli.ai/trial:latest
export FRIENDLI_CONTAINER_SECRET="{YOUR FRIENDLI CONTAINER SECRET e.g. flc_XXX}"
export HF_TOKEN="{YOUR HUGGING FACE TOKEN e.g. hf_XXX}"
export HF_MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION="{YOUR GPU DEVICE NUMBER e.g. device0}"
After pulling the docker image, you can use the docker images
command to list all of the pulled images and the docker image inspect $FRIENDLI_CONTAINER_IMAGE
command to view a detailed JSON output for the registry.friendli.ai/trial
image. You can use the env
command to list all of your exported environment variables.
By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can optionally change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.
export FRIENDLI_PORT="8001"
export FRIENDLI_GRAFANA_PORT="3001"
Lastly, execute the docker compose up -d
command from our GitHub repository to launch a Friendli Container along with two more containers, each from a Grafana image (grafana/grafana
) and a Prometheus image (prom/prometheus
).
git clone https://github.com/friendliai/container-resource
cd container-resource/quickstart/docker-compose
docker compose up -d
Try the docker ps
command to see a list of all of your running containers. You can also execute the docker compose down
command in the container-resource/quickstart/docker-compose directory to stop and remove all of the running containers.
Send Chat Completion Inference Requests
Send inference requests to the Llama 3.1 8B Instruct model right away after successfully launching the Friendli Container! For instance, you can query the LLM with the question “If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?” by executing the below command.
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?"}]}'
Chat completion inference result:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "A classic lateral thinking puzzle!\n\nThe answer is not simply \"30 hours\" because the number of shirts doesn't directly impact the drying time. The drying time remains the same, 5 hours.\n\nThink about it: if you have 5 shirts, it takes 5 hours to dry them. If you have 10 shirts, it will still take approximately 5 hours to dry them. And if you have 30 shirts, it will still take approximately 5 hours to dry them.\n\nSo, the answer is still 5 hours to dry 30 shirts.",
"role": "assistant"
}
}
],
"created": 1724389731,
"usage": {
"completion_tokens": 114,
"prompt_tokens": 38,
"total_tokens": 152
}
}
Get Started with Friendli Container x Grafana
Have you ever wanted to monitor the performance of your generative AI models in real-time? Imagine having the power to visualize and analyze your models’ inference metrics, all in one place. With Friendli Container x Grafana, that’s exactly what you can do! The enhanced observability helps you quickly identify bottlenecks, optimize performance, and ensure smooth, efficient operations.
Grafana is an open-source analytics and monitoring platform that visualizes LLM inference metrics by connecting to data sources like Prometheus. Through docker compose
, we were previously able to launch a Grafana container for monitoring the Friendli Container and a Prometheus container which is configured to scrape metrics from Friendli Container processes.
Observe your Friendli Container with Grafana by opening http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine
on your browser and logging in with username admin
and password admin
. You can update the password after the initial login and now access the dashboards showing useful engine metrics, such as throughput and latency.
If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine. You may also want to use -l login_name
or -p port
options to connect to the GPU machine using SSH.
# Change these variables to match your environment.
export GPU_MACHINE_ADDRESS="{ADDRESS OF THE GPU MACHINE}"
LOCAL_GRAFANA_PORT=3123
FRIENDLI_GRAFANA_PORT=3000
ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:127.0.0.1:$FRIENDLI_GRAFANA_PORT"
Afterwards, open http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine
(for our example above, the URL would be http://127.0.0.1:3123/d/friendli-engine
) on your browser and log in to view the dashboard.
Monitor Different Metrics Using the Grafana Dashboard
While Friendli Container is handling inference requests, the Grafana dashboard provides a comprehensive view of the performance metrics. By default, metrics are served at http://localhost:8281/metrics
. You can configure the port number using the command line option --metrics-port
. Our supported metrics are categorized into four groups: counters, gauges, histograms, and quantiles.
Counters: Cumulative metrics that are often used with the rate()Prometheus function to calculate throughput.
friendli_requests_total
friendli_responses_total
friendli_items_total
friendli_failure_by_cancel
friendli_failure_by_timeout
friendli_failure_by_nan_error
friendli_failure_by_reject
Gauges: Dynamic numerical values that go up and down and represent the current value.
friendli_current_requests
friendli_current_items
friendli_current_assigned_items
friendli_current_waiting_items
Histograms: Histograms are used to track the distribution of the following three variables over time.
Friendli TCache hit ratio
The length of input tokens
The length of output tokens
Quantiles: Quantiles are used to display the current p50(median), p90, and p99 percentiles for the following three variables.
Request completion latency (in nanoseconds)
Time to first token (TTFT) (in nanoseconds)
Request queueing delay (in nanoseconds)
Run the code below in your terminal to repeatedly send inference requests to the Friendli Container and observe the LLM inference performance through the Grafana Dashboard:
while :; do curl -X POST http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "What makes a good leader?"}], "max_tokens": 30, "stream": false}'; sleep 0.5; done
The image below showcases the inference metrics for Llama 3.1 8B Instruct, highlighting the efficiency and responsiveness of the Friendli Container. The overall throughput of 1.13 requests per second (req/s) indicates low traffic, with a steady flow of data being processed, while the P90 latency of 375 milliseconds demonstrates that the majority of requests are handled with minimal delay. The P90 Time to First Token (TTFT) is particularly impressive at 12.6 milliseconds, underscoring the engine’s ability to start generating responses almost instantly.
Explore our blog post “The LLM Serving Engine Showdown: Friendli Engine Outshines” for an in-depth comparison of P90 TTFT performance across various LLM inference engines, including vLLM.
Grafana Templates for Friendli Container
One of the excitements of using Grafana lies in the ability to customize dashboards to suit your specific needs. Whether you’re tracking a space mission or managing container instances, Grafana allows you to design dashboards that deliver the insights you require. This flexibility enables you to visualize data in ways that are most meaningful to monitoring Friendli Container instances.
A simple way to create new dashboards is by importing JSON files into Grafana. For instance, the friendli-engine-dashboard-per-instance.json
file allows you to set up a dashboard that monitors multiple Friendli Container instances. You can download our Grafana templates as JSON files from the Grafana Templates for Friendli Container section of the Friendli Container GitHub Repository. After downloading the template, go to the 'Import dashboard' page in Grafana and upload the JSON file as shown below.
Conclusion
In summary, integrating Grafana with Friendli Container facilitates a comprehensive, real-time monitoring system, which is crucial for maintaining the optimal performance of generative AI models. The Grafana dashboards imported through our templates display visualizations of critical performance indicators, such as requests throughput, latency distributions, and cache hit ratios. By leveraging these observability features, you can fine-tune your generative AI deployments for maximum reliability and scalability, making them well-suited for production-level workloads.