Run an LLM on Apple Silicon Mac using llama.cpp

Explore how to configure and experiment with large language models in your local environment

9 min readDec 27, 2023

These are directions for quantizing and running open source large language models (LLM) entirely on a local computer. The computer I used in this example is a MacBook Pro with an M1 processor and 16 GB of memory. The LLM I used for this example is Mistral 7B; I show how to fetch this model and quantize its weights for faster operation and smaller memory requirements; any Apple Silicon Mac with 16 GB or greater should be able to run this model. In addition, there are many other models in the same family available. All the models and code are free and open-source. Thanks to Alex Ziskind for a great explanation of this process; and Georgi Gerganov for the llama.cpp GitHub project.

A quick note before you get started.

There are simpler ways to get LLMs running locally. These include a marvelous program called LM Studio, which let’s you get and run models using a GUI; and there is Ollama, a command line tool for running models. These are two I’ve used; there are many more. However, I wanted to understand better how to do this in more detail, thus the process outlined below. If you just want to run LLMs locally or you aren’t comfortable with terminal commands and building software, using Ollama or LM Studio makes sense. But, if you’re up for an adventure, let’s get started!

Requirements

Python
Local development tools (installed with Xcode)
Conda (optional)
Fast network connection

The local computer needs to have python and local development tools. This method uses local programs built from source, and a python program, to configure the model, and to run a server locally in order to access the model. I recommend setting up a Conda environment, however this is optional.

You also want to do this with a good and fast network connection: the models are large — you can expect to download a few tens of GB with the example model. Note that you can get access to pre-quantized models, which is the way to do it if your network connection is slow. I’ve illustrated a more general process here for experimentation with different quantizations of different models. Different models and quantizations will perform differently for various use cases; taking care with completion requests sent to the model can also have large effects on performance. Experimentation encouraged!

Step 1: Set up your environment

Optionally, set up an isolated environment for running the python utility used to convert the model. Then, in order to get the model you’ll need to install a tool to allow git clone of very large files: git clone is used to download the model itself, and these models are typically very large.

#Do some environment and tool setup
conda create --name llama.cpp python=3.11
conda activate llama.cpp

#Allow git download of very large files; lfs is for git clone of very large files, such as the models themselves
brew install git-lfs
git lfs install

Step 2: download, configure, and test a model.

The process for doing this shown below uses source code and a python utility from a GitHub project called llama.cpp; you will build some configuration utilities in a directory on your local computer. Next, you will use these utilities to make the model smaller, by converting its format and quantizing the values in the model. (This will allow running the model in less memory than it would ordinarily require.) Then, you can optionally run a testing step to verify that the model has been built correctly, and to make a quick check on the model’s performance on you particular computer.


#The llama.cpp github project provides source code for managing the model and the library requirements for the Python program
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
#provide to python the libraries to convert models to .gguf format; the libraries are listed in this file
pip install -r requirements.txt
#Build the executables you'll use to configure the model you download
make

#Get model from huggingface, rename it locally to openhermes-7b-v2.5, and move it to the models directory
#Models generally are in https://huggingface.co/teknium
git clone https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B openhermes-7b-v2.5
mv openhermes-7b-v2.5 models/

#Note that the model download can take some time, depending on your connection speed. Be patient. Checking network usage and the size of the local file can help reassure you that the process is working.

#Convert model to a standard format and quantize it
#Check memory sizes of the models before attempting to test or run them on a 16GB machine
#I’ve found that I can run models up to about 10 GB in a computer with 16GB RAM
python3 convert.py ./models/openhermes-7b-v2.5 --outfile ./models/openhermes-7b-v2.5/ggml-model-f16.gguf --outtype f16
./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf q4_k
./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q8_0.gguf q8_0

#Note: I suggest testing and running the q4_k quantization of the model first. With larger memory than 16GB you can run larger models and larger quantizations of models. The rule of thumb is the quantized model should be no bigger than about 2/3 your RAM memory…

#Test the performance of various quantizations and models; useful for selecting models for raw performance generating tokens
#The following  should test the q4_k quantization
./batched-bench ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf 4096 0 99 0 2048 128,512 1,2,3,4

Step 3: Get ready to use the model by starting the server

At this point you should have at least one model that you’ve converted and quantized. Now it’s time to use it. You can do this by starting a server that will allow you connect to the model in two different ways:

access the model you’ve configured as a chat website served locally — no network required and no data shared with anyone else
or you can write code to access it in order to do more effective prompting of the model and to use the model in a larger program.

Go ahead and start the server from your terminal. You can run it either in the foreground or background. Here’s how to do either one.

Run the server in the foreground:

#Start a server to allow access to models from Python code, for example
./server -m models/openhermes-7b-v2.5/ggml-model-q4_k.gguf --port 1234 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 99 -n 512

#if the server is running in the foreground, to stop it use ctl-c

There should be a line such as this in your terminal:

llama server listening at http://0.0.0.0:1234

Indicating that you have a server running.

Or, run the server in the background:

#or to run it in the background…
./server -m models/openhermes-7b-v2.5/ggml-model-q4_k.gguf --port 1234 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 99 -n 512 &

#if running in background: ps -aux | grep ./server to get PID; then kill <PID> to stop the server

Step 4: Access the model using the server you just started

You can access the model either as a chat webpage, or access it by writing code to access the API that the server implements.

Use your model as a chat server

Entering localhost:1234 in a browser should produce a GUI chat window. Interact with it as you would an online model such as ChatGPT. The web server and the model will be running entirely local to your computer. When you do this, the web page in your browser should look something like this:

You can start a chat by entering a question in the bottom panel where it says “Say something…” Hit return after entering your query and you’re chatting with your own local model.

Use your model by accessing it with code

You can write some code to use the model to make completion requests. The server/model understands the OpenAI API protocol for chat.completions requests, and produces completions in the same format. The code is the same as code using an OpenAI model with the exception that you use a model running entirely locally; it doesn’t access the internet at all. To do this in Python, you will have to install the openai library and import it in your code. openai is a PyPi library that implements a subset of the OpenAI API. With this, you can use a model using the same protocol as you would use accessing the OpenAI models.

For example:

import os
# import the OpenAI class. The local model server implements the same interface as the remote server hosted by OpenAI, so you can use the same code for both.
from openai import OpenAI

# Instantiate OpenAI to use the local server; no need for an API key with this local server. 
# The var client refers to the local server running the local model.
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

# Create a completion using the local server. The var completion will contain the response from the server.
completion = client.chat.completions.create(
    
    model="local-model", # this field is currently unused because the server only has one model
    temperature=0.9, # temperature controls the "creativity" of the model.
  
  # messages is a list of structured text input to the model.
  # the messages in this array create a prompt that the model uses to generate a response.
  messages=[
    {"role": "system", "content": "Always answer in Colloquial English."},
    {"role": "user", "content": "Introduce yourself as a helpful assistant who can help writing song lyrics. Include your name, and favorite music genre."},
  ]
)

# completion is a data structure containing the response from the model.
# Print the first message from it. You can also print the whole completion to see more details, print(completion).
print(completion.choices[0].message)

You will need to install openai to get this to run. This is a “hello world” completion request. The code points to the local server, which in the case of the example command (above) to start the server, is at localhost port 1234… By default this server does not require a key. A completion request (a prompt if you’re used to something such as ChatGPT) done programmatically is structured using a messages array, along with some optional parameters. The simple example shown uses a system message and a user message as shown.

I ran this code, and the model generated the fowing text:

ChatCompletionMessage(content="Hey there! I'm a helpful assistant here to assist you with writing song lyrics. My name is Melody, and I'm all about that pop music. Let's create some amazing lyrics together!")

You can experiment with different messages and see what the model produces. Asking it to answer in Pirate English (instead of Colloquial English) can be amusing.

You can write more elaborate code to use the model as part of an useful tool: a local chat application that never shares data, or an assistant programmer, or an application for reasoning about a set of information you supply to the model… The possibilities are limited only by your imagination.

Notes:

The Python program convert.py compactifies the model by converting it into a binary format called GGML.

The GGML format is a compact binary format designed for efficient storage and loading of deep learning models on devices with limited memory and computing resources, such as a MacBook Pro with 16GB memory. The GGML format supports various data types, including FP16(half-size float), FP32 (single-precision floating-point), and INT8 (8-bit integer).

Converting a trained deep learning model to the GGML format with FP16 data type can reduce the model size and improve the inference speed, while maintaining the model accuracy. The FP16 data type provides a good balance between accuracy and efficiency, as it requires half the memory and computational resources of FP32, while still preserving most of the accuracy.

However, because this is intended to run on a computer with limited memory, the model will be quantized to further reduce its size. Model quantization is a technique used to reduce the memory footprint and computation requirements of deep learning models. By quantizing the model weights from floating-point values (such as FP16) to lower-precision integers, the model can be stored and processed more efficiently, which can lead to faster inference times and lower memory usage. Note that model quantization can introduce some accuracy loss, as the quantized weights may not exactly match the original floating-point values. In many cases, the accuracy loss is negligible or acceptable, and the benefits of quantization outweigh the drawbacks.

The command ./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf q4_k performs model quantization on the ggml-model-f16.gguf file and saves the quantized model to the ggml-model-q4_k.gguf file using the q4_k quantization scheme.

./quantize: runs the quantize script, which is a tool for model quantization.
./models/openhermes-7b-v2.5/ggml-model-f16.gguf: specifies the path to the input model file in the FP16 format.
./models/openhermes-7b-v2.5/ggml-model-q4_k.gguf: specifies the path to the output model file in the quantized format.
q4_k: specifies the quantization scheme to use. The q4_k scheme quantizes the model weights to 4-bit integers using k-means clustering.

The q4_k quantization scheme used in this command is a reasonable choice for quantizing large language models like Mistral 7B. It uses a clustering algorithm to group similar weights together and represent them with a small number of integer values. This can lead to significant memory savings, and speed up inferencing.