Benchmarking Apple’s MLX vs. llama.cpp
It might be a bit unfair to compare the performance of Apple’s new MLX framework (while using Python) to llama.cpp (written in C/C++ using Metal). But I think it is valuable to get an indication, how much potential performance one might lose, with the flexibility of doing LLMs in the new framework. Especially since MLX (as of version 0.0.6) now also supports quantization, and therefore became very interesting for running LLMs locally on Apple silicon.
TLDR: current MLX seems OK at LLM prompt-processing (-15% slower) and token-generation (-25% slower) performance, as well having a good RAM usage. It’s (still ?) lagging for quantized token-generation (1/2 slower than I expected, based on llama.cpp behavior). Also MLX’s loading of the model is much slower.
This is a quick&dirty first test done in an evening to give you some ideas, not an extensive study. More work needs to be done e.g. by using Mixtral 8x7B instead of small llama2 7B.
My setup — I benchmarked locally on my machine using:
- M2 Max Mac Studio, 96GB RAM
- llama.cpp with Llama-2–7B in fp16 and Q4_0 quantization. Download and generate the fp16 GGUF-file from the huggingface repository. For the Q4_0 quantization, TheBloke’s provides GGUF files (but please observe the llama-2 license!).
- mlx-examples and installation instructions in it’s llm/llama folder. You need to either convert the Llama-2–7B weights from the above step’s huggingface files, or download according to the README.md. Using llama.py to test the models
- I tweaked llama.py to llama_bench.py for printing the timings in a manner comparable to llama.cpp. This is a quick&dirty hack to get some results, not professional coding. Just copy the llama_bench file to your llm/llama folder above and run it there.
- The default prompt used is 511x “ hello” (512 tokens together with BOS). Generating the response for 128 tokens, temperature 0.0
My results for llama-2–7B fp16:
- llama.cpp: model-load ~2.8s, prompt-processing ~772 token/s, token-generation ~23 token/s, ~16 GB RAM used (pure model: 12.55 GB)
- MLX: model-load ~4s, prompt-processing ~652 token/s, token-generation ~19 token/s, ~ 16 GB RAM used
My results for llama-2–7B 4-Bit quantized:
- llama.cpp: model-load ~0.9s, prompt-processing ~685 token/s, token-generation ~61 token/s, ~6 GB RAM used (pure model: 3.56 GB)
- MLX: model-load ~1.7s, prompt-processing ~438 token/s, token-generation ~31 token/s, ~ 6 GB RAM used
Note: prompt-processing token/s results vary a lot with prompt-length and batch-size, detailed comparisons are complicated. I used the same prompt-length and token-generation length as llama.cpp in their benchmark results for all Apple silicon here. The artificially large 512-token prompt is in order to test the GPU limits, token-generation is mostly memory-bandwidth constrained.
I hope this provides you with food for thought. MLX’s performance really surprised me, I would have thought, that MLX would be much slower.
P.S.: Updated this article on Dec 26 with entirely new benchmarking numbers, in order to better compare it to the llama.cpp Apple silicon performance results.
