Build and run llama2 LLM locally

4 min readAug 15, 2023

P/S: These instructions are tailored for macOS and have been tested on a Mac with an M1 chip.

In this guide, we’ll walk through the step-by-step process of running the llama2 language model (LLM) locally on your machine.

llama2 models are a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Fine-tuned LLMs, Llama 2-Chat, are optimized for dialogue use cases.

This guide will cover the installation process and the necessary steps to set up and run the model. Please note that the instructions provided have been tested on a Mac with an M1 chip.

Prerequisites

Before we begin, make sure you have the following prerequisites installed on your system:

1. Python: You’ll need Python 3.8 or higher. You can check your Python version by running the following command in your terminal:

python3 --version

Python 3.11 is recommended which can be installed using the below command -

brew install python@3.11

2. Git: Ensure you have Git installed. If not, you can install it using a package manager like Homebrew:

brew install git

Cloning the llama2 Repository

1. Open your terminal

2. Navigate to the directory where you want to clone the llama2 repository.
Let's call this directory llama2

3. Clone the llama2 repository using the following command:

git clone https://github.com/facebookresearch/llama.git

4. Clone the llama C++ port repository

git clone https://github.com/ggerganov/llama.cpp.git

Now, you should have both the repositories in your llama2 directory.

5. Navigate to inside the llama.cpp repository and build it by running the make command in that directory.

✗ cd llama.cpp
✗ make

Requesting access to Llama Models

1. Go to the link https://ai.meta.com/resources/models-and-libraries/llama-downloads/

2. Enter your details in the form as below

3. You’ll receive an email like below with a unique custom URL to download the models

4. Navigate to the llama repository in the terminal

cd llama

5. Run the download.sh script to download the models using your custom URL

/bin/bash ./download.sh

6. It will prompt you to enter the download URL, enter the custom URL received in email and then select the models you want to download. e.g. if you choose the 7B-chat model, it will get downloaded and be present at ./llama2/llama/llama-2–7b-chat

Converting the downloaded model(s)

1. Navigate to inside the llama.cpp repository

cd llama.cpp

2. Create a python virtual environment for llama2 using the command below, I'd chosen the name llama2 for the virtual environment.

python3.11 -m venv llama2

3. Activate the virtual environment

source llama2/bin/activate

The activated virtual environment will appear at the beginning in the command line inside parenthesis.

4. Install all required python dependencies. They are present in the requirements.txt

python3 -m pip install -r requirements.txt

5. Run the convert command while still in the llama.cpp directory to convert the model to f16 format

python3 convert.py --outfile models/7B/ggml-model-f16.bin --outtype f16 ../../llama2/llama/llama-2-7b-chat --vocab-dir ../../llama2/llama

--outfile is for specifying the output file name (Don't forget to create the 7B folder inside ./llama2/llama.cpp/models directory).

--outtype is for specifying the output type which is f16

then the downloaded model is specified

--vocab-dir is for specifying the directory containing tokenizer.model file

It will create a file ggml-model-f16.bin which is of the size 13.5 GB.

6. Next quantize the model to reduce its size

./quantize  ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

It will create a quantized model ggml-model-q4_0.bin which is of the size 3.8 GB.

7. All set! Now you can run it and try one of the prompt examples inside the .prompts folder.

./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

-m for specifying the model file

-n for specifying the number of tokens

--color for specifying that input text should be formatted as colored text

-i for specifying that the program to be run in an interactive mode

-r "User:": for specifying a marker to indicate user's input in the conversation. In this case, the marker used is "User:"

-f ./prompts/chat-with-bob.txt: for specifying path to the file (chat-with-bob.txt) containing prompts or input for the program

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Build and run llama2 LLM locally

Prerequisites

Cloning the llama2 Repository

Requesting access to Llama Models

Converting the downloaded model(s)

Written by Karan Kakwani