[Rust] Serving Llama 3 Quantized (GGUF) on GPU with Candle-RS and Tide
Serving Quantized Llama 3 Model on a GPU on Windows 11
I want to deploy a local, quantized version of Llama3 that can reliably complete a set of batch tasks on my Windows 11 machine. This quantized model was tested on Windows 11 inside of a docker container (nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04) with GPU support.
I also attempted to load the full model onto my GPU, but I did not have enough space. Maybe in future article, I can share that code for Gemma.
This example uses QuantFactory/Meta-Llama-3–8B-Instruct-GGUF. It has over 60k+ downloads and 250+ likes on HuggingFace. I decided to use the 8-bit intruct model. The script loads the model in a Docker volume directory that has 1) the GGUF file and 2) the Meta-Llama-3–8B tokenizer. Allow of the optional structures that allow the script to download and run different models was removed. The CLI argument were also replaces with hardcoded values. This improved the readability and made debugging easier when combining examples together.
Finally, the full example example/repository for how I served the model with a Tide HTTP server is shared in the last section. Note that I’m still quite a noob with HTTP servers, in general.