Run Llama on your Raspberry Pi 5 without using Ollama

wessel braakman
10 min readJan 19, 2024

--

So I have been tinkering with my Raspberry Pi 5 8gb since I got it in december. I found many guides to install an LLM on it, but kept running into issues that I could not easily get past. A lot of this had to do with the source computer on which I was supposed to retrieve/build/quantize the LLM, and some of this had to do with me not being able to install all I needed on my RPi5 without running into issues.

It is therefore I am writing this guide, in which I pinpoint where I got stuck, and write how I worked around this.

So this is by no means a guide that I have completely figured out on my own (I am not an expert on any of these topics), but more a guide that should help out in case someone gets stuck.

I mainly used THIS GUIDE by Marek Żelichowski that I found on LinkedIn, and have added the steps I needed to get it working on my own devices. I know he states at the bottom of his blog that it is a result from input from several people/sources, but credit should be given when it is due, and his guide was one of the few that worked for me which did not require Ollama.

I have tried to give credit where credit was due (links to external posts and sites), but if any credits are missing please let me know and I will happily add them. Now on with it!

What do we need?

  • Source PC with either Windows or a Linux distribution to retrieve and quantize the LLM(‘s)
  • 8GB Raspberry Pi 5 to run the LLM on
  • A memory card with at least 32GB containing a pre-installed OS such as Raspbian (I personally use Ubuntu 23.04 on my RPi for this)
  • An USB stick with at least 22GB of space available to transport the LLM from your source PC to your RPi

Preparing the source PC

We start on our (Linux-based) source PC. Since I don’t own one and did not feel like bothering with dual booting or reinstalling my PC with a Linux distro, I decided to use “WSL”. This is a build-in Microsoft feature through which we can run Linux distro’s directly from our Windows environment. For those who are already working on a Linux PC, the following steps may not be necessary.

To install WSL:

Note that I used THIS GUIDE for setting up WSL on my Windows PC.

First, run PowerShell as an Administrator. We do this by right-clicking PowerShell and selecting the “Run as administrator” option. You may need to fill out your Admin password in order to do this, depending how security on your machine is configured.

In the PowerShell window that opens, run the following command:

> wsl --install

This will install Ubuntu as the OS to work with. Alternatively, you can select a specific Linux distro that can be directly installed using wsl by using the command

> wsl --list --online

Pick the distro you wish to install and add it to your next install command. I used the default, but if you would want to run Ubuntu 22.04, you would use the following command

> wsl --install -d Ubuntu-22.04

After the installation is done, you can verify that it is installed with this command

> wsl -l -v

It is strongly recommended to reboot your PC after installing WSL. After the reboot, your Linux distro can be found in your start menu.

Opening one of these programs will open a command prompt (CMD) where you can run commands against your Ubuntu installation. We will be using these commands to download and build the Llama project, and to download and quantize a model that can run on the Raspberry Pi.

Configuring our Linux source PC

Now that we have our Ubuntu installation up and running, or if we skipped the above steps because we were already running a Linux distro, it is time to start installing the dependencies we need to run the steps below.

First of all, we want to make sure our system is up to date

> sudo apt update

Then, we want to install Git, which is what we will use to clone/download the Llama.cpp project into our environment

> sudo apt install git

Finally, we need to install some tools that we need to make the project and quantize the LLM

> python3 pip install torch numpy sentencepiece

Note that at this step, I have gotten stuck a LOT of times. My system kept complaining about 2 things, which I will explain now;

Pip could not be found. In order to omit this, I needed to install Pip through the Python package as a part of apt install

> sudo apt install python3-pip

It kept bugging me about x509 errors, “self-signed certificate in certificate chain”. Without going into too much detail, this error means there is something wrong with my network configuration and I cannot get through. I have tried a bunch of workarounds to get past this, but never got it to work. The issue may have been that I was trying to do this from behind a proxy or firewall, which made it near impossible to get the configuration right. I ended up switching to another system that did not have these restrictions, and voila, all worked like a charm.

Most important, when running “python3 -m pip install” commands, I got an error stating that I was in an “externally managed environment” and had to use some kind of virtual environment (venv) to get all working. Well, I tried using that and it didn’t work for me, so my workaround was to remove the symlink to the so called “external environment”. Taken from THIS POST on stackoverflow, a comment on one of the answers;

> sudo mv /usr/lib/python3.11/EXTERNALLY-MANAGED /usr/lib/python3.11/EXTERNALLY-MANAGED.old

Now that that is all out of the way, you should have no further issues installing the dependencies through Pip.

Lastly, we need to install G++ and Build Essential

> sudo apt install g++ build-essential

Downloading and building the Llama project

For downloading the Llama project into our workspace, we will use the “git clone” command.

> git clone https://github.com/ggerganov/llama.cpp

After downloading is finished, we will enter the folder we just downloaded

> cd llama.cpp

Now, we will “make” the project to build the necessary files to run our models

> make

While this is running (or after it is done), we can download one of the models that we want to run on our RPi5. The models can be downloaded from any source (such as huggingface), but the tutorial I followed used a magnet link that can be used with a torrent client. I used QBitTorrent as a client, which can be installed via

> sudo apt install qbittorrent

After it is installed, you can use the following command to open the application

> qbittorrent

This will open the GUI of the torrent client. Now you can add a magnet link by clicking on the “link” icon, and paste in the following magnet link

magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA

Since we want to try and run our RPi5 with the 7B model (the other models are a lot larger and require a lot more capacity), we want to select only the 7B folder, as well as the files below it (tokenizer.checklist.chk and tokenizer.model)

Once the files have finished downloading, copy them to the “llama.cpp/models” folder. One can do this through the command line in a terminal, or by opening the file explorer GUI. The following command will open the file explorer in the current folder. You can then locate the 7B folder and tokenizer files, and copy them to the llama.cpp/models folder.

> gio open .

Now before we can start quantizing our model to make it possible to run it on our RPi5, we need to edit a JSON file. This file contains a value which will make building the model crash. There is uncertainty about why this value is in this file, as the fix is to merely adjust a number and it all runs smoothly.

In the llama.cpp/models/7B folder, locate the file “params.json” and open it. I used VIM to edit this file.

> vi llama.cpp/models/7B/params.json

After opening it with VIM, you press “i” to enter insert (edit) mode. Edit the value “vocab_size” from -1 to 32000. Once you are done editing, press the “ESC” key to quit editing mode. To save your changes, type a colon “:”, followed by “wq”. Press the “Enter” key to proceed, and your file has been updated!

Now we need to convert the model to GGML FP16 format. This may take a while, depending on your PC performance. This is also the reason we have a source PC running Linux, since this is the part that will not work on a Raspberry Pi. To convert the model, we use built-in functionality. Note that we run this from the llama.cpp folder (since our command starts with “models/”

> python3 convert.py models/7B

After the conversion is done, we need to quantize the model. This means that this large model can run more efficiently on devices with less performance. From the tutorial I used as a basis:
“This basically means that all the neural network weights used in the model will be changed from float16 to init8 which will make it much easier for the not-so-powerful machines to handle. I strongly recommend (but it’s not needed) to read more about it here.

To do the quantization, we again use existing functionality

> ./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

After quantization, we can check whether the model is working by running the following command from the llama.cpp folder

> ./examples/chat.sh

I immediatly got an error stating the model could not be found

Luckily there was a clear root cause. The chat.sh script expected to find a “llama-7b” folder, where I had the downloaded “7B” folder. Renaming the folder omitted this problem.

> mv models/7B models/llama-7b

Now I got the LLM to work, and I got a chatbot called Bob I could ask questions!

Now we know all is working, the next step is to copy the “llama-7b” folder from the “models” folder onto an USB stick.

Installing the LLM on your Raspberry Pi

Now, boot the RPi5 if you haven’t already, and open a terminal (CTRL+ALT+T). We need to run a few of the same commands as we did on our source PC since we need the llama framework on our Pi as well. First make sure your system is up to date, and make sure Git is installed to clone the project.

> sudo apt update
> sudo apt install git

Next, we clone the llama.cpp project, as we did on our source PC

> git clone https://github.com/ggerganov/llama.cpp

We install the same modules as we installed on our laptop. Keep in mind the workarounds that were needed there, as we may also need them on our RPi5

> python3 -m pip install torch numpy sentencepiece

Now we make sure to have G++ and Build Essential installed

> sudo apt install g++ build-essential

Go into the llama.cpp folder and make (build) the llama project

> cd llama.cpp
> make

Next, move the content from your external drive to the /models/ folder in your llama.cpp project. I of course had issues because my external harddrive could not be detected or mounted properly. Thankfully, with the help of the internet, I got through.

From THIS POST on stackoverflow, I got the following commands

> sudo fdisk -l

This command lists the available disks. You should find your external harddrive or USB stick here

> sudo mount [your external drive] /mnt

For example, if your external drive was found at /dev/sdxn, then the command would be

> sudo mount /dev/sdxn /mnt

After mounting is successful, you can find the contents of your external drive in the /mnt folder. If you want to visually copy the files rather than using the commandline, you can open this in file explorer by navigating to the /mnt folder and run

> gio open .

All we have to do now is to run the model exactly how we did on our source PC. From inside the llama.cpp folder, run the following command

> ./examples/chat.sh

There you go! You are now running your very own (offline) AI chatbot on your Raspberry Pi 5!

Note that llama does not run super smoothly on RPi5, but that is mostly just very cool that such a big model can run quite efficiently on such a small device. I’m also not entirely convinced by the “knowledge” this model has, just looking at the screenshot above. What is cool, though, is that we can ask questions, and it generates answers for us. Which is basically what Generative AI is all about.

Again, I am by no means an expert in any of the fields that are touched by this blog/tutorial. All initial credits go to Marek Żelichowski and the work he has done on his blog/tutorial. My attempt was to clarify some of the steps that got me in trouble when following some of the tutorials that can currently be found online.

--

--