Comparing Throughput Performance of Running Local LLMs and VLM on different systems
As a data engineer, I am fascinated by testing out some generative AI models and installing/running models locally. Large Language Model (LLM) and Vision-Language Model (VLM) are the most interesting ones. OpenAI provides ChatGPT website and mobile apps. Microsoft creates Windows 11 Copilot for us to use. However, we could not control which data were sent over to the internet and stored in their databases. Their systems are not open-source and are just like mysterious black boxes.
With some generous companies( like Meta and Mistral AI) or individuals open-source their models and the active communities build tools layer-by-layer so that we can easily run LLMs and VLM on our home computers. Raspberry Pi 5 with 8GB RAM had been tested out in this article (Running Local LLMs and VLMs on the Raspberry Pi). It’s a credit-card size small Single Board Computer (SBC). I would like to find cheaper computer machines/solutions or VMs to test out the performance of generating tokens or characters to run so that it gives me good value for the money or even for general public. Things should be considered are text output speed, text output quality, and money cost.
Evaluating generated content is done by other research parties. Like this one mentioned mistral-7b outperforms llama2–13b in knowledge, reasoning, and comprehension. https://mistral.ai/news/announcing-mistral-7b/. That’s why I include mistral and llama2 in my testing for LLMs.
Ollama can be currently running on macOS, Linux, and WSL2 on Windows. The memory usage and CPU usage are not easy to control with WSL2, so I excluded the tests of WSL2. There are multiple LLMs and VLM models can be downloaded in the eco system. That’s why I use Ollama as test bed for benchmarking with different AI models on multiple systems. The installation is very simple. In the terminal, run the following
curl https://ollama.ai/install.sh | sh
I have built a tool to test the throughput of tokens/sec generated from Ollama LLMs on different systems. The code (ollama-benchmark) is written in Python3 and is open-sourced under MIT license. If you feel that there are more features should be added or bugs to be fixed, please let me know. Text output quality might not be easy to measure, so I focus on text output speed in this experiment. (higher tokens/s the better)
Tech Specification of Machines or VMs used for testing
- Raspberry Pi 5 with 8GB RAM ((Ubuntu 23.10 64 bit OS ) Quad-core 64-bit Arm CPU)
- Ubuntu 23.10 64 bit OS with 4-core processor and 8GB RAM via VMware Player 17.5 installed on a Windows 11 Laptop host.
- Ubuntu 23.10 64 bit OS with 8-core processor and 16GB RAM via VMware Player 17.5 installed on a Windows 11 Desktop host.
- Apple Mac mini (Apple M1 Chip) (macOS Sonoma 14.2.1 OS) 8-core CPU with 4 performance cores and 4 efficiency cores , 8-core GPU, 16GB RAM
- NVIDIA T4 GPU (Ubuntu 23.10 64 bit OS), 8 vCPU, 16GB RAM
To make the comparison more promising and consistent, the Raspberry Pi 5 was installed with Ubuntu 23.10 64 bit OS. The OS installation steps can be followed in the below video.
From Ollama website under llama2 model page, it mentioned the followings.
Memory requirements
- 7b parameters models generally require at least 8GB of RAM
- 13b parameter models generally require at least 16GB of RAM
Models that we are going to test
- mistral:7b (LLM)
- llama2:7b (LLM), llama2:13b (LLM)
- llava:7b, llava:13b (Image to texts, Q&A from images) (VLM)
From the memory constraints, here are models that I want to test performance on different machines.
Sample prompts examples are stored in benchmark.yml
version: 1.0
modeltypes:
- type: instruct
models:
- model: mistral:7b
prompts:
- prompt: Write a step-by-step guide on how to bake a chocolate cake from scratch.
keywords: cooking, recipe
- prompt: Develop a python function that solves the following problem, sudoku game
keywords: python, sudoku
- prompt: Create a dialogue between two characters that discusses economic crisis
keywords: dialogue
- prompt: In a forest, there are brave lions living there. Please continue the story.
keywords: sentence completition
- prompt: I'd like to book a flight for 4 to Seattle in U.S.
keywords: flight booking
In each round, 5 different prompts are used to evaluate the output tokens/s. The average of the 5 numbers is recorded. I ran Raspberry Pi 5 first, and here is the recorded video.
Summary of the benchmark of tokens/s for different models on different systems
Thoughts on AI model (LLMs & VLM ) inference throughput performance results
- As we can see from above videos, the computation utilization happens mostly on GPU cores and GPU VRAM.
- To run inference faster, choose powerful GPUs.
- Suppose the comfortable interaction between human beings and AI model comes with throughput flow rate 7 tokens/sec (example is in Video 5). 13 tokens/sec is too fast for most people to follow as shown in Video 6.
- The future OS coming with AI support (Copilot) is going to have minimum 16GB RAM. The output from AI is meaningful/trustful, not too fast, and not too slow. This part is also consistent with the news that Microsoft announced : (Microsoft sets 16GB default for RAM for AI PCs — machines will also need 40 TOPS of AI compute: Report).
Conclusion
Running LLMs locally not only enhances data security and privacy but it also opens up a world of possibilities for professionals, developers, and enthusiasts. With this throughput performance benchmark, I would not use Raspberry Pi 5 as LLMs inference machine, because it’s too slow. I would say running LLMs and VLM on Apple Mac mini M1 (16GB RAM) is good enough. If you want more powerful machine to run LLMs inference faster, go for renting Cloud VMs with GPUs.
Disclaimer: I have no affiliation with Ollama or Raspberry Pi or Apple or Google. All views and opinions are my own and do not represent any organization.