Inference any LLM with serverless in 15 minutes

3 min readJun 4, 2023

Uncovering the efficiency of deploying large language models with Runpod’s serverless infrastructure

GitHub: https://github.com/OpenAccess-AI-Collective/servereless-runpod-ggml
Demo: https://huggingface.co/spaces/openaccess-ai-collective/ggml-runpod-ui
Arena: https://huggingface.co/spaces/openaccess-ai-collective/rlhf-arena

So you’ve built a language model, and you’ve uploaded it to HuggingFace. Now what? Well, today I’m excited to share with you a practical way to deploy your large language models (LLM) using serverless workers from Runpod. This method is not only cost-effective but also efficient for general testing workloads. All you need to do is upload your quantized model to Hugging Face (HF), create a template & endpoint in Runpod, and you’re ready to start testing your language model.

The Magic of Runpod

Runpod is a platform that allows you to run your language models using serverless workers. In simple terms, it’s like having an army of workers ready to execute your AI models whenever you need them. And the best part? You only pay for what you use.

The project, serverless-runpod-ggml, is a Docker image that allow you to take trained language models from Hugging Face and create serverless…

Inference any LLM with serverless in 15 minutes

The Magic of Runpod

Written by Wing Lian