Member-only story
How Much Stress Can Your Server Handle When Self-Hosting LLMs?
Do you need more GPUs or a modern GPU? How do you make infrastructure decisions?
How does it feel when a group of users suddenly start using an app that only you and your dev team have used before?
That’s the million-dollar question of moving from prototype to production.
As far as LLMs are concerned, you can do a few dozen tweaks to run your app within the budget and acceptable qualities. For instance, you can choose a quantized model for lower memory usage. Or you can fine-tune a tiny model and beat the performance of giant LLMs.
You can even tweak your infrastructure to achieve better outcomes. For example, you may want to double the number of GPUs you use or choose the latest-generation GPU.
But how could you say Option A performs better than Option B and C?
This is an important question to ask ourselves at the earliest stages of going into production. All these options have their costs — infrastructure cost or the lost end-user experience.
The solution to this crucial question isn’t new. Load testing has been practiced for all software releases.
In this post, I’ll discuss how to quickly perform a load test with the free Postman app. We’ll also try to pick the best infrastructure between a single A40 GPU, 2X of the same, or upgrading to an L40S GPU.
The Plan: How do we decide on the infrastructure
Here’s our goal.
We host Llama 3.1 8B for our inference and use Ollama to serve our models. However, we don’t know if the hardware that hosts this model is sufficient.
We currently have an A40 GPU with 48 GB of VRAM, 50GB of RAM, and a 9vCPU deployed to serve the inference engine. We rented this infrastructure for US$280.8/month.