Local LLM eval tokens/sec comparison between llama.cpp and llamafile on Raspberry Pi 5 8GB model

Published in

aidatatools

3 min readApr 15, 2024

Results first:

With the newest Raspberry Pi OS released on 2024–03–15, LLMs run much faster than Ubuntu 23.10. It’s tested on llama.cpp and llamafile.

On the same Raspberry Pi OS, llamafile (5.75 tokens/sec) runs slightly faster than llama.cpp (4.77 tokens/sec) on TinyLLamaQ8_0.gguf model.

OS preparation

For Ubuntu 23.10 via Raspberry Pi Imager, here is what I chose.

For Raspberry Pi OS, here is what I chose.

Running LLMs should stop screen recording, because it drains some hardware resources. Just take a screenshot after you see the throughput results (eval tokens/sec). It will make the number more beautiful.

llama.cpp

GitHub - ggerganov/llama.cpp: LLM inference in C/C++

LLM inference in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.

github.com

Model file TinyLlama-GGUF Q8_0 (move it inside models folder)

TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF at main

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The execution command is like this.

make -j && ./main -m models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Image 3: eval rate for llama.cpp is 4.77 tokens/sec

llamafile

GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Distribute and run LLMs with a single file. Contribute to Mozilla-Ocho/llamafile development by creating an account on…

github.com

LLamafile TinyLlama-GGUF Q8_0 GGUF (Just download it)

jartine/TinyLlama-1.1B-Chat-v1.0-GGUF at main

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The execution command is like this.

chmod u+x TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile
./TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile --temp 0.7 -p 'Building a website can be done in 10 simple steps:\nStep 1:'

Image 4: The eval rate for llamafile is 5.75 tokens/sec

The neofetch output of Raspberry Pi OS

Image 5: neofetch output for Rasberry Pi OS

The throughput eval rate (tokens/sec) is around 1~1.5 tokens/sec on Ubuntu 23.10. It’s because when running LLM, it’s also recording the screen.

Image 6: neofetch shows wrong CPU: BCM2835. Actually, for RPI5, it should be BCM2712.

Conclusion

Llamafile with the suitable OS support, it can run slightly faster than llama.cpp. With recent default support of Vulkan GPU on Raspberry Pi OS, https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV Hopefully, the community can leverage the GPU on Raspberry Pi 5 to run even faster. Let’s wait and watch the news.

Local LLM eval tokens/sec comparison between llama.cpp and llamafile on Raspberry Pi 5 8GB model

Results first:

OS preparation

llama.cpp

GitHub - ggerganov/llama.cpp: LLM inference in C/C++

LLM inference in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.

TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF at main

We're on a journey to advance and democratize artificial intelligence through open source and open science.

llamafile

GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Distribute and run LLMs with a single file. Contribute to Mozilla-Ocho/llamafile development by creating an account on…

jartine/TinyLlama-1.1B-Chat-v1.0-GGUF at main

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Conclusion

Published in aidatatools

Written by Jason TC Chuang