Thoughts on Apple Silicon Performance for Local LLMs

Andreas Kunar
4 min readNov 25, 2023

--

Apple silicon, with its integrated GPUs and unified, large, wide RAM looks very tempting for AI work. Especially when using Georgi Gerganov’s amazing llama.cpp (either directly or as e.g., a Python library via llama-cpp-python). But how does this compare to a modern PC with an NVIDIA GPU?

Apple AI Performance — Generated with Adobe Firefly

I got myself a 96GB RAM M2 Ultra Mac Studio for my AI work. A modern PC with fast CPU, lots of RAM, and an NVIDIA 4090 GPU would have cost me similarly. It would also have huge GPU horsepower, 2.5x the VRAM bandwidth of my M2 Max, but only 1/3 the VRAM size, and it probably would draw ~10x the power — under full load more like a noisy fan-heater, and nothing I want next to me on my desk.

With my M2 Max, I get approx. 60 token/s for llama-2 7B (Q4 quantized). And because I also have 96GB RAM for my GPU, I also get approx. 8 token/s for llama-2 70B (Q4) inference. I’m quite happy with this for my local German-language specific RAG work. But should I have buyer’s remorse?

I found some interesting reasoning for the performance-aspects of running local LLMs like llama.cpp, besides just having enough GPU RAM. First in a tweet by OpenAI’s Andrej Karpathy — how memory-bandwidth constrains non-batched LLM inference. And recently Georgi Gerganov also initiated a very interesting discussion — about Apple silicon llama.cpp performance measurements:

During single-user LLM inference‘s response token-generation (“TG” in Georgi’s tables), the memory bandwidth mostly constrains performance. The GPUs’ many/fast ALUs are not fully utilized, because the GPU is busy getting data.

Apple Silicon on M2 Max has a 512 bit wide memory bus, yielding up to 400 GB/s, and then twice that for the M2 Ultra. This area is an aspect, where Apple Silicon has significant advantages. Apple silicon memory is usually much faster, because it’s up to 4–8x broader than consumer-grade Intel/AMD CPU memory. And it does not need copying to/from the GPU to the CPU. Yes, it’s still slower than modern GPU VRAM. But GPU VRAM is expensive, power-hungry, and comparatively small. So for non-batched inference, an M2 Ultra’s 800GB/s might almost match the 1000GB/s of a 4090. Also models >13B usually don’t even fit in the 4090’s 24GB RAM (even when quantized). And if you think Apple computers are expensive, you have not looked at the price of workstation-class NVidia GPUs …

However during processing of the LLM prompt and also during LLM training (PP in Georgi’s tables), the LLM can process batches of data. This tends to utilize the cache and GPUs more fully. So here the number of GPUs matter more, and here the much faster modern NVidia GPUs really shine. However Apple Silicon might still have some tiny RAM-size positives.

OK, no buyer’s remorse yet for me on pure performance!

What are — to me — the strong downsides of using Apple silicon for AI (or any open-source work):

  • Quite weak security, because Apple does not support GPUs (or the ANE) in VMs or containers. When you deal with open-source, Python,… code it is almost impossible to secure the “supply chain” you used against malware-risks. And if you cannot strongly contain their scope to a secured container or VM, you have a significant security-risk exposure. I’m still working on somehow addressing this, maybe as material for my next article.
    The competition: NVIDIA’s CUDA can be run virtualized e.g., on Intel/AMD Linux. Microsoft is working on completely virtualizing some NVIDIA GPUs with some Intel CPUs in their v.next Hyper-V. Also Linux, Windows, and even MacOS support nested hypervisors on Intel/AMD CPUs, Apple silicon doesn’t. Apple’s mandatory “Apple Virtualization Framework” seems mostly to blame for this, the silicon should technically support it from M2 onwards. Parallels, Docker,… all have to use it — Parallels did their own, better virtualizer for Intel-Macs. There might be hope for future MacOS versions, but I won’t hold my breath for it.
  • NVIDIA’s CUDA is the de-facto standard for a lot of work. Apple Metal is fully supported by llama.cpp, PyTorch,… — but CUDA still dominates.
    Not that bad for me overall, since I mostly use llama.cpp, but worth mentioning.

I hope this might help you a little in deciding which Mac is best for your AI work, or if it is at all. This is my first medium.com posting — please have mercy on my non-native-speaker writing ;-).

--

--

Andreas Kunar

Retired IT-techie / marketeer / manager and photography-coach