--

vLLM uses paged attention mechanism which significantly improves LLM throughput and latency. Generally with bigger models, we have seen issues with latencies and throughput unless you use mechanisms such as paged attention used in vLLM. Since you are using Llama CPP version, have you seen any latency or throughput issue? If yes, give vLLM library a try!

--

--