How To Deploy LLaMA 3 Locally With Llama.cpp on Your ARM Macs
Step-by-Step Guide to Deploy LLaMA Locally on Your Mac
When ARM-based Macs first came out, using a Mac for machine learning seemed as unrealistic as using it for gaming. But now, you can deploy and even fine-tune LLMs on your Mac. With the recent release of LLaMA 3, deploying a powerful model locally, similar to GPT-4, is now possible.
Deploying your language model locally comes with many benefits, like better privacy and lower costs. But it requires good speed. You don’t want to wait several minutes for a response. Luckily, with llama.cpp, which can use Mac’s Metal GPU, your model can run much faster on your Mac.
There are many guides on deploying LLaMA 2, like the great video by Alex Ziskind, but deploying LLaMA 3 is a bit different. So this article is to guide you through the updated steps. I use Meta-Llama-3-8B-Instruct
model as the example, but deploying other LLaMA 3 models is similar. So if your Mac is powerful (with enough space as well), feel free to use the 70B version(the token generation speed difference depends on your Mac, but generally the 70B version is about three times slower than the 8B version)!
Requirements
- Python 3.11: It is recommended to use Conda for this.
- Git LFS: Required for downloading the model files, install by running
git lfs install
- Hugging Face Account and LLaMA 3 access: You need access to LLaMA 3 via Hugging Face. You can request access, and it usually gets approved within a few hours — faster than Meta’s ITSD, I’d bet.
- Hugging Face CLI: For downloading the model.
- Xcode Command Line Tools: Necessary for building llama.cpp, install by running
xcode-select --install
in your terminal. - llama.cpp: Clone this from GitHub repo.
Steps to Deploy LLaMA 3 Locally
1. Conda (optional)
Conda helps manage packages and dependencies. You can download it from Anaconda. This makes sure your environment is isolated and won’t mess with other projects.
After installing Conda, run this command in your terminal. This will set up an environment named llama3 for running LLaMA 3
conda create --name llama3 python=3.11
conda activate llama3
2. Install Hugging Face CLI
The Hugging Face CLI lets you download and manage models. Install it with :
pip3 install -U "huggingface_hub[cli]"
huggingface-cli login
(If this command doesn’t work for you, you can install Hugging Face CLI using brew install huggingface-cli
instead)
Log in using your Hugging Face token, which you can find here.
3. Download the Hugging Face Model
Clone the LLaMA 3 model repository (make sure you have access to LLaMA 3 before this). It will take some time to download it depending on your network.
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
4. Clone and Build llama.cpp
Clone the llama.cpp repo and run the following command. This is what will let you run the LLaMA model on your Mac (make sure you are running the command in the llama.cpp
folder):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
5. Install Dependencies
After building llama.cpp
, you need to install some Python packages required from it to work with the model. Install them with (make sure you are in the llama.cpp
folder):
python3 -m pip install -r requirements.txt
6. Convert the Hugging Face Model
You need to convert the Hugging Face model to a format that llama.cpp
can use. Run this command and it creates a file named ggml-model-f16.gguf
in the folder where you downloaded the LLaMA 3 model (make sure you are running the command in the llama.cpp
folder):
python3 convert-hf-to-gguf.py /your-folder/Meta-Llama-3-8B-Instruct
7. Quantize the Model
Quantizing the model makes it smaller and faster by reducing the precision of the weights(in our case, it will reduce the model size from 16GB to around 4GB). This step will create a ggml-model-Q4_0.gguf
file in the LLaMA 3 model folder (make sure you are running the command in the llama.cpp
folder):
./quantize /your-folder/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf q4_0
8. Run the Model
You are now ready to start a conversation with the locally deployed LLaMA 3 model! Run this command to start a simple conversation with your model (make sure you are running the command in the llama.cpp
folder):
./main -m /your-folder/Meta-Llama-3-8B-Instruct/ggml-model-Q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
For more detailed commands and options, you could refer to the llama.cpp documentation.
Now, you can enjoy the power of LLaMA 3 running locally on your ARM Mac!