Your own ChatGPT assistant running on your machine.
3 min readAug 20, 2023


How to run Facebook’s LLaMA ‘ChatGPT’ model on your mac.

First we ned to install a couple of packages via homebrew. If you do not have homebrew installed on your mac, you can follow the instructions here:

Now in your terminal, install wget and md5sha1sum via brew

brew install wget
brew install md5sha1sum

We are going to use the llama.cpp repo to be able to run the model, so we will need to download two repos as described below. One will be for running the model, and one will be the model itself.

The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook. So, here we go.

First, clone this repo:

Now, to download the Facebook model, go here, and fill in the form:

You will get an email a few seconds later with a link to the facebook repo. You will also get a large link that we will use in a bit to be able to download the model itself. The links will be numbered, 1 and 2. First of all, follow link 1 to the Facebook repo (link also here below).

Clone the facebook repo:

Now in your terminal, cd in to the facebook repo that we just cloned, and run the to download a model. When you run this command, it will ask you to Enter the URL from the email. Enter the large

cd llma
chmod +x

It will ask for the link from your email, paste now paste Link 2 in to the terminal. (I think this link runs out in 24 hours)

You will then be asked:

Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:

I recommend starting with 7B-chat or 13B-chat.

Type in one or more models, then press enter. Each model takes about 15 -30 mins to download, depending on your internet connection.

For the next step, you will need Python installed on your laptop. I used Python version 3.11.1 for this. Now, cd in to llama.cpp repo and run the commands below.

cd llama.cpp
pip install -r requirements.txt

The following instructions are for the 13B model. If you downloaded the 7B, just switch 13 for 7 wherever you see it.

Now, assuming your download has completed, copy the contents of llama/llama-2–13b-chat/ from the Facebook llama repo to llama.cpp/models/13B/ in the llama.cpp repo.

Now, for the next step, we will need a file called tokenizer.model, which we can get from the huggingface website.

You may need to create an account on huggingface to do this. So, do that, and then download:

Now, put the tokenizer.model file in llama.cpp/models/13B/ folder with the other files that you just copied. Now, inside the llama.cpp repo, run the following command:

python models/13B/

There will now be a file in the 13B folder called something like ggml-model-f32.bin

Change the name of the next commands to whatever that file is, and from inside the llama.cpp repo, run the following commands.

./quantize ./models/13B/ggml-model-f32.bin ./models/13B/ggml-model-q4_0.bin q4_0
./main -m ./models/13B/ggml-model-q4_0.bin -n 128

Now to run it and use the model, run the following command from inside the llama.cpp repo:

./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

You now have your own assistant running in your terminal.

Thanks for reading.