Local LLMs On Apple Silicon
Unleashing the power of Unified Memory Architecture
Introduction
Large Language Models (LLMs) are no longer just a buzzword. They’ve transformed the way we think about AI, making waves across various industries. While many initially associated LLMs with renowned models like ChatGPT or OpenAI’s GPT variants, the landscape has evolved. The open-source community has stepped up, with Meta’s Llama model leading the charge. However, the restrictive license of the initial Llama model limited its potential. Recognizing this, Meta released Llama 2 with a more permissive license, paving the way for broader applications. If you’re keen to dive into the world of LLMs, there’s no better time than now. This article will guide you through setting up local LLMs.
The Machine
When it comes to running Large Language Models (LLMs) locally, not all machines are created equal. The efficiency and speed at which these models operate are largely determined by two pivotal factors: memory capacity and raw processing power. Unlike traditional setups where the CPU and GPU have separate memory pools, M1 and M2 machines share a common memory space. As Andrej Karpathy aptly puts it, “(Apple Mac Studio) M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today.” If you’re considering investing in a new machine for LLM-based development, the Mac Studio should be at the top of your list.
Local LLM Server Setup
The most efficient way to run open-source models locally is via the llama.cpp project by Georgi Gerganov. This C library is tailored to run Llama and other open-source models locally. Here's a quick rundown of its features:
- Pure C codebase
- Optimized for Apple Silicon
- No third-party dependencies
- Zero memory allocations during runtime
Steps to Setup llama.cpp:
- Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
2. Compile llama.cpp by simply running following command in your terminal
make
You will see following output in the terminal window
3. Choose your model
Huggingface offers a plethora of open-source LLMs. I personally recommend TheBloke’s repository. For this guide, let’s use the Llama2 13B Orca 8K 3319 GGUF model. Depending on your RAM, choose the appropriate model.
For a 16GB RAM setup, the openassistant-llama2–13b-orca-8k-3319.Q5_K_M.gguf model is ideal. Once downloaded, move the model file to llama.cpp/models/
.
4. Run the Model Locally
Navigate to the llama.cpp
directory and execute:
./main -m models/openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf -p "Capital of France is" -e -ngl 1 -t 10 -n 2 -c 4096 -s 8 --top_k 1
This should return ‘Capital of France is Paris.’, along with the performance data for the LLM inference.
For a deeper dive into the available arguments, run:
./main --help
5. Web server
Though running the LLM through CLI is quick way to test the model, it is less than ideal for developing applications on top of LLMs. To do this, we can leverage the llama.cpp web server. You can start the web server from llama.cpp directory by running
./server -m models/openassistant-llama2-13b-orca-8k-3319.Q5_K_M.gguf -c 4096
Running this will result in following output
You can access the UI by navigating to http://localhost:8080/. Alternatively, you can use the curl
command:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "Capital of France is","n_predict": 2}'
Conclusion
With this setup, you’re now equipped to develop LLM applications locally, free from the constraints of external APIs. Dive in and explore the limitless possibilities that LLMs offer. Happy coding!