Llama 2 for Mac M1

Anthony Sun
4 min readAug 13, 2023

--

Getting Llama 2 working on Mac M1 with llama.cpp and python binding

Courtesy of Adobe Stock (https://stock.adobe.com/au/search?k=llama+love)

I am always curious about the LLMs, since Facebook released LLama2. I have been trying to get it working on my Mac. Firstly I have attempted to use the HuggingFace model meta-llama/Llama-2–7b-chat-hf model. As usual, the process of getting it setting seemed straightforward, but it just refused to run, apart from maxing out my memory. It is extremely slow (or hangs), xFormers (https://github.com/facebookresearch/xformers) which it uses seemed to be built for nVidia GPUs. (If anyone who can find how to get it working, please let me know).

Through hours and trail and error, mindless googling. I have finally managed to get it working. So I’d like to document it and hoping it can save someone else sometime. Or, maybe me at a later stage.

The method I am using is 3 steps, will try be as brief as possible.

I am merely a documenter of the process, cudos and thanks for all the smart people out there to get this amazing model working.

1. Download official facebook model

The github location for facebook llama 2 is below: https://github.com/facebookresearch/llama

Open terminal and clone the repository:

cd ~/Documents
git clone git@github.com:facebookresearch/llama.git

To use the facebook model for free (unless you are servicing 700 million users), you need to request a new download link from Facebook. Once you have agreed with the terms, an email will be sent to you with a download link (typically a couple of days, subsequent requests will take seconds). The link will only be valid for 24 hours, but you can re-request, to avoid hassle, just download the models you want within 24 hours.

To download the model, it is fairly easy, just cd to you repo and run download.sh (you may need to do chmod on download.sh):

cd ~/Documents/llama
./download.sh

You don’t need to download all, personally I think 7B performs pretty well. Depends on the tasks, you may also want to download 7B-chat, which is more tuned for conversation. I am finding for zero shot 7B-chat generate better results for me.

2. Use llama.cpp to convert and quantize the downloaded models

The model you have download will still need to be converted and quantized for work. Quantization is a method to reduce the accuracy of the weights to minimize memory and compute. Typically the models may be trained with high precision float 32, we can lower it to lower bits to cater for the spec of our Macbooks.

The first step is to clone llama.cpp repostory (https://github.com/ggerganov/llama.cpp):

cd ~/Documents
git clone git@github.com:ggerganov/llama.cpp.git

Then install brew and xcode command line tools and make the binary:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

xcode-select –install

cd llama.cpp

LLAMA_METAL=1 make

The next step is to convert download model to the ggml format (https://ggml.ai/), the following create a virtual environment, install the required packages and convert the llama 7b chat model:

python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt

python convert.py ../llama/llama-2-7b-chat

This will generate the another file under the model you’ve downloaded on facebook. I didn’t generate a name as I was lazy, the filename should look like ggml-model-f32.bin.

To quantize the model (make it smaller), the following quantize the model from F32 to 4 bit integer (and suprisingly it is still performing well):

./quantize ../llama/llama-2-7b-chat/ggml-model-f32.bin ../llama/llama-2-7b-chat/ggml-model-f32_q4_0.bin Q4

It is almost done, and you can see a significant size reduction from 14gigs to 3gigs.

3. Use python binding via llama-cpp-python

To use it in python, we can install another helpful package. The installation of package is same as any other package, but make sure you enable metal.

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

Using the model

To use the model, I have created a sample code below. The prompt instruction is what I have dugged up on reddit. If someone can find better ones, please let me know. :)

from llama_cpp import Llama

model_path = "~/Documents/llama/llama-2-7b-chat/ggml-model-f32_q4_0.bin"
model = Llama(model_path = model_path,
n_ctx = 2048, # context window size
n_gpu_layers = 1, # enable GPU
use_mlock = True) # enable memory lock so not swap


prompt = """
[INST]<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

What is the best way to learn programming?
[/INST]
"""

output = model(prompt = prompt, max_tokens = 120, temperature = 0.2)
output
{'id': 'cmpl-92ec0995-11e5-4eeb-9311-98d0f23c4885',
'object': 'text_completion',
'created': 1691921343,
'model': './models/ggml-llama-7b-chat-q4_0.bin',
'choices': [{'text': 'Thank you for asking! Learning programming can be an exciting and rewarding journey, and there are several great ways to get started. Here are some recommendations:\n1. Online Courses: Websites such as Codecademy, Coursera, and Udemy offer a wide range of programming courses, from beginner to advanced levels. These courses are often interactive and include practical exercises to help you learn by doing.\n2. Books: If you prefer learning through reading, there are many excellent books on programming available. "Code Complete" by Steve McConnell, "C',
'index': 0,
'logprobs': None,
'finish_reason': 'length'}],
'usage': {'prompt_tokens': 143,
'completion_tokens': 120,
'total_tokens': 263}}

To use langchain (api reference: https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html):

from langchain.llms import LlamaCpp
llm = LlamaCpp(model_path=model_path)

llm(prompt)

--

--

Anthony Sun

Data Science Manager in finance who is always learning.