Local LLM eval tokens/sec comparison between llama.cpp and llamafile on Raspberry Pi 5 8GB model
Results first:
With the newest Raspberry Pi OS released on 2024–03–15, LLMs run much faster than Ubuntu 23.10. It’s tested on llama.cpp and llamafile.
On the same Raspberry Pi OS, llamafile (5.75 tokens/sec) runs slightly faster than llama.cpp (4.77 tokens/sec) on TinyLLamaQ8_0.gguf model.
OS preparation
For Ubuntu 23.10 via Raspberry Pi Imager, here is what I chose.
For Raspberry Pi OS, here is what I chose.
Running LLMs should stop screen recording, because it drains some hardware resources. Just take a screenshot after you see the throughput results (eval tokens/sec). It will make the number more beautiful.
llama.cpp
Model file TinyLlama-GGUF Q8_0 (move it inside models folder)
The execution command is like this.
make -j && ./main -m models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
llamafile
LLamafile TinyLlama-GGUF Q8_0 GGUF (Just download it)
The execution command is like this.
chmod u+x TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile
./TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile --temp 0.7 -p 'Building a website can be done in 10 simple steps:\nStep 1:'
The neofetch output of Raspberry Pi OS
The throughput eval rate (tokens/sec) is around 1~1.5 tokens/sec on Ubuntu 23.10. It’s because when running LLM, it’s also recording the screen.
Conclusion
Llamafile with the suitable OS support, it can run slightly faster than llama.cpp. With recent default support of Vulkan GPU on Raspberry Pi OS, https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV Hopefully, the community can leverage the GPU on Raspberry Pi 5 to run even faster. Let’s wait and watch the news.