Docker Run llama-2 models on an OrangePi 5B using llama.cpp

Arko Basu
4 min readApr 14, 2024

--

In this article we will go through the steps involved in getting llama-2 models working on an inexpensive ARM based SBC like Orange PI.

We are going to leverage llama.cpp, the most amazing work done by Georgi Gerganov in running LLM inference with minimal setup on variety of hardware.

Note: I am using a self-compiled linux kernel with my Orange Pi 5B. The default OrangePi Ubuntu Server Jammy images have Docker pre-loaded but doesn’t work with Ceph FS Storage (with RWO access modes) because of missing kernel modules. I wish to deploy this on a Kubernetes, and hence had to recompile the kernel with RBD and Ceph, and then install Docker to get everything setup. However, I will not cover Kubernetes based deployment in this article (may be in a future one). The sole reason for testing the docker approach was so it can easily port into Kubernetes with Ceph to utilize persistent distributed storage, someday. So if you are grabbing an off the shelf OPi5B you don’t really need to worry about Ceph or RBD kernels and installing docker.

Showing OS details

Goal

To run llama-2 models using docker containers on any commodity hardware (of both architecture types: amd64 and arm64) with 4 bit Quantization. In this article’s case: on an Orange Pi 5B (OPi5B). And more specifically, models: llama-2-7b/llama-2–7b-chat and llama-2-13b/llama-2-13b-chat.

For demonstration purposes in this article I will use the llama-2-13b-chat model.

Memory requirements for running llama-2 models with 4-bit quantization. Lifted from documentation

An OPi5B has enough memory to run both 7b-chat and 13b/13b-chat 4-bit quantized models. So let’s get started.

  1. Install and/or validate docker:
# Install Docker
$ sudo apt-get update
$ sudo apt-get upgrade
$ curl -fsSL test.docker.com -o get-docker.sh && sh get-docker.sh

# Give current user permissions to run docker without sudo
$ sudo usermod -aG docker $USER

# Activate changes to current shell without relogging
$ newgrp docker

# Validate docker
$ docker image ls

2. Accept Licenses from Meta Website. This will provide you with a commercial license to use this open-source model. Once you accept, Meta will send you an email with a link to download the models and their weights.
Note: These download links are short-lived and only valid for 24 hours. So make sure you download them within that time frame. Otherwise you will have to re-accept the terms again.

3. Download the models using the script provided by Meta:

# Clone the Meta repository
$ git clone https://github.com/meta-llama/llama.git
$ cd llama

# Run the download script. This will prompt you for the link you received on the email.
# Get all the models you want in 1 go.
$ ./download.sh

4. Convert the desired Llama models in ggml format:

# Replace /path/to/llama with root directory of the folder you cloned your repo to in the above step. 
# This repo will have all the sub-folders containing model weights and tokenizers
$ docker run -v /path/to/llama:/models ghcr.io/ggerganov/llama.cpp:full --convert "/models/llama-2-13b-chat" --outtype f16
Image of Conversion step

5. Optimize with quantization process ggml:

# Replace /path/to/llama with root directory of the folder you cloned your repo to. 
# This repo will have all the sub-folders containing converted ggml files
$ docker run -v /path/to/llama/:/models ghcr.io/ggerganov/llama.cpp:full --quantize "/models/llama-2-13b-chat/ggml-model-f16.gguf" "/models/llama-2-13b-chat/ggml-model-q4_0.bin" 2
Image showing Quantization step

6. Run the 4bit quantized model:

# Replace /path/to/llama with root directory of the folder you cloned your repo to. 
# This repo will have all the model quantized binaries
$ docker run -v /path/to/llama:/models --entrypoint '/app/main' ghcr.io/ggerganov/llama.cpp:full -m /models/llama-2-13b-chat/ggml-model-q4_0.bin -n 500 -p "You are a technical blogger with years of experience writting some of the best tech blogs on the planet. Write me an engaging SEO optimized title in less than 61 characters about the following topic: Running llama2 models with 4 bit quantization using llama.cpp on Orange Pi like inexpensive arm devices"
Images for Inference Run

Congratulations! Now we have a running inference llama-2 model that is containerized and can run on inexpensive ARM devices. Sure the token generateion rates are not great. But you don’t have to shell out a lot of money to cloud providers in configuring virtual compute run times to test a llama-2 model.

--

--