How To Deploy LLaMA 3 Locally With Llama.cpp on Your ARM Macs

Published in

The Tech Collective

4 min readMay 28, 2024

Step-by-Step Guide to Deploy LLaMA Locally on Your Mac

When ARM-based Macs first came out, using a Mac for machine learning seemed as unrealistic as using it for gaming. But now, you can deploy and even fine-tune LLMs on your Mac. With the recent release of LLaMA 3, deploying a powerful model locally, similar to GPT-4, is now possible.

Deploying your language model locally comes with many benefits, like better privacy and lower costs. But it requires good speed. You don’t want to wait several minutes for a response. Luckily, with llama.cpp, which can use Mac’s Metal GPU, your model can run much faster on your Mac.

There are many guides on deploying LLaMA 2, like the great video by Alex Ziskind, but deploying LLaMA 3 is a bit different. So this article is to guide you through the updated steps. I use Meta-Llama-3-8B-Instruct model as the example, but deploying other LLaMA 3 models is similar. So if your Mac is powerful (with enough space as well), feel free to use the 70B version(the token generation speed difference depends on your Mac, but generally the 70B version is about three times slower than the 8B version)!

Requirements

Python 3.11: It is recommended to use Conda for this.
Git LFS: Required for downloading the model files, install by running git lfs install
Hugging Face Account and LLaMA 3 access: You need access to LLaMA 3 via Hugging Face. You can request access, and it usually gets approved within a few hours — faster than Meta’s ITSD, I’d bet.
Hugging Face CLI: For downloading the model.
Xcode Command Line Tools: Necessary for building llama.cpp, install by running xcode-select --install in your terminal.
llama.cpp: Clone this from GitHub repo.

Steps to Deploy LLaMA 3 Locally

1. Conda (optional)

Conda helps manage packages and dependencies. You can download it from Anaconda. This makes sure your environment is isolated and won’t mess with other projects.

After installing Conda, run this command in your terminal. This will set up an environment named llama3 for running LLaMA 3

conda create --name llama3 python=3.11 
conda activate llama3

2. Install Hugging Face CLI

The Hugging Face CLI lets you download and manage models. Install it with :

pip3 install -U "huggingface_hub[cli]" 
huggingface-cli login

(If this command doesn’t work for you, you can install Hugging Face CLI using brew install huggingface-cliinstead)

3. Download the Hugging Face Model

Clone the LLaMA 3 model repository (make sure you have access to LLaMA 3 before this). It will take some time to download it depending on your network.

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

4. Clone and Build llama.cpp

Clone the llama.cpp repo and run the following command. This is what will let you run the LLaMA model on your Mac (make sure you are running the command in the llama.cpp folder):

git clone https://github.com/ggerganov/llama.cpp 
cd llama.cpp 
make

5. Install Dependencies

After building llama.cpp, you need to install some Python packages required from it to work with the model. Install them with (make sure you are in the llama.cpp folder):

python3 -m pip install -r requirements.txt

6. Convert the Hugging Face Model

You need to convert the Hugging Face model to a format that llama.cpp can use. Run this command and it creates a file named ggml-model-f16.gguf in the folder where you downloaded the LLaMA 3 model (make sure you are running the command in the llama.cpp folder):

python3 convert-hf-to-gguf.py /your-folder/Meta-Llama-3-8B-Instruct

7. Quantize the Model

Quantizing the model makes it smaller and faster by reducing the precision of the weights(in our case, it will reduce the model size from 16GB to around 4GB). This step will create a ggml-model-Q4_0.gguf file in the LLaMA 3 model folder (make sure you are running the command in the llama.cpp folder):

./quantize /your-folder/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf q4_0

8. Run the Model

You are now ready to start a conversation with the locally deployed LLaMA 3 model! Run this command to start a simple conversation with your model (make sure you are running the command in the llama.cpp folder):

./main -m /your-folder/Meta-Llama-3-8B-Instruct/ggml-model-Q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

For more detailed commands and options, you could refer to the llama.cpp documentation.

Now, you can enjoy the power of LLaMA 3 running locally on your ARM Mac!