Local LLMs On Apple Silicon

Aaditya Bhat
4 min readSep 8, 2023

--

Unleashing the power of Unified Memory Architecture

Photo by Fotis Fotopoulos on Unsplash

Introduction

Large Language Models (LLMs) are no longer just a buzzword. They’ve transformed the way we think about AI, making waves across various industries. While many initially associated LLMs with renowned models like ChatGPT or OpenAI’s GPT variants, the landscape has evolved. The open-source community has stepped up, with Meta’s Llama model leading the charge. However, the restrictive license of the initial Llama model limited its potential. Recognizing this, Meta released Llama 2 with a more permissive license, paving the way for broader applications. If you’re keen to dive into the world of LLMs, there’s no better time than now. This article will guide you through setting up local LLMs.

The Machine

When it comes to running Large Language Models (LLMs) locally, not all machines are created equal. The efficiency and speed at which these models operate are largely determined by two pivotal factors: memory capacity and raw processing power. Unlike traditional setups where the CPU and GPU have separate memory pools, M1 and M2 machines share a common memory space. As Andrej Karpathy aptly puts it, “(Apple Mac Studio) M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today.” If you’re considering investing in a new machine for LLM-based development, the Mac Studio should be at the top of your list.

Local LLM Server Setup

The most efficient way to run open-source models locally is via the llama.cpp project by Georgi Gerganov. This C library is tailored to run Llama and other open-source models locally. Here's a quick rundown of its features:

  • Pure C codebase
  • Optimized for Apple Silicon
  • No third-party dependencies
  • Zero memory allocations during runtime

Steps to Setup llama.cpp:

  1. Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

2. Compile llama.cpp by simply running following command in your terminal

make

You will see following output in the terminal window

make output

3. Choose your model

Huggingface offers a plethora of open-source LLMs. I personally recommend TheBloke’s repository. For this guide, let’s use the Llama2 13B Orca 8K 3319 GGUF model. Depending on your RAM, choose the appropriate model.

Llama2 13B Orca 8K 3319 GGUF model variants

For a 16GB RAM setup, the openassistant-llama2–13b-orca-8k-3319.Q5_K_M.gguf model is ideal. Once downloaded, move the model file to llama.cpp/models/.

4. Run the Model Locally

Navigate to the llama.cpp directory and execute:

./main -m models/openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf -p "Capital of France is" -e -ngl 1 -t 10 -n 2 -c 4096 -s 8 --top_k 1

This should return ‘Capital of France is Paris.’, along with the performance data for the LLM inference.

LLM output

For a deeper dive into the available arguments, run:

./main --help

5. Web server

Though running the LLM through CLI is quick way to test the model, it is less than ideal for developing applications on top of LLMs. To do this, we can leverage the llama.cpp web server. You can start the web server from llama.cpp directory by running

./server -m models/openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf  -c 4096

Running this will result in following output

Llama.cpp server

You can access the UI by navigating to http://localhost:8080/. Alternatively, you can use the curl command:

curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "Capital of France is","n_predict": 2}'

Conclusion

With this setup, you’re now equipped to develop LLM applications locally, free from the constraints of external APIs. Dive in and explore the limitless possibilities that LLMs offer. Happy coding!

--

--

Aaditya Bhat

Engineer with a passion for exploring the latest developments in ML and AI. Sharing my knowledge and experiences through writing.