Deploy Serverless Generative AI on AWS Lambda with OpenLLaMa

Sean Bailey
5 min readMay 17


We are witnessing a revolution in the field of artificial intelligence, with an explosion of generative AI capabilities across a multitude of platforms. Recently, the open-source release of a LLaMa compatible model, trained on the open RedPyjama Dataset, opened new avenues for the application of these generative models.

Efforts to make these models as efficient as possible led to the llama.cpp project. Instead of using expensive and limited GPU capabilities to run these models, we can load them into more accessible CPU and RAM configurations. Many of these quantized models can provide reasonably responsive inferences on as little as 4–6 GB of RAM on a CPU, and even on an Android smartphone.

This led me to an idea: What if we could create a scalable, serverless LLM Generative AI inference engine? After some experimentation, I discovered that not only was it possible, but it actually worked quite well!

That’s how OpenLLaMa on Lambda was born.

What is OpenLLaMa on Lambda?

OpenLLaMa on Lambda is a project where we deploy a container capable of running llama.cpp converted models onto AWS Lambda. This approach leverages the scalability that Lambda provides, minimizing cost and maximizing compute availability for your project.

Using the provided AWS CDK code, you can create and deploy a Lambda function leveraging your model of choice. This setup comes with a FastAPI frontend accessible from a Lambda URL. The beauty of AWS Lambda lies in its generous free tier — you get 400k GB-s of Lambda Compute each month, which means you can have scalable inference of these Generative AI LLMs at minimal cost.

Please note that you will need to have ggml quantized versions of your model and your model sizes should ideally be under 6GB. Also, your inference RAM requirements cannot exceed 9GB, or your Lambda function will fail.

How Does it Work?

Lambda Docker Containers have a hard limit of 10GB in size, but that offers plenty of room for these models. The models cannot be stored in the proper invocation directory, but you can place them in /opt. By pre-baking the models into the /opt directory, you can include the entire package into your function without needing extra storage.

Getting Started

To get started, you’ll need Docker installed on your system. You’ll also need to select a GGML quantized model compatible with llama.cpp from Huggingface. Additionally, you need to have the AWS CDK installed on your system, along with an AWS account, proper credentials, etc. Python 3.9+ is required.

Once you have the prerequisites, you can download the OpenLLaMa on Lambda repository and follow the installation instructions for your specific operating system. The installation process will guide you through building the container and deploying your Lambda function.

Using OpenLLaMa on Lambda

FastAPI Web Documentation

Once the deployment is complete, you can navigate to the URL provided by the CDK output in your browser. You’ll see a simpleFastAPI frontend, where you can test out the functionality of your model. The model doesn’t load until you use the /prompt endpoint of your API. So, if there are problems with your model, you won't discover them until you reach that point. This design ensures you can verify if the Lambda function is working properly before testing the model.

Here’s a quick breakdown of the input values you’ll be working with:

  • text -- This is the text you'd like to prompt the model with. It comes pre-loaded with a question/response text, which you can modify in the llama_cpp_docker/ file.
  • prioroutput -- If you want the model to continue where it left off, provide the previous output of the model here. Just remember to keep the same original text prompt.
  • tokencount -- Defaulted at 120, this value represents the number of tokens the model will generate before returning output. The lower the token count, the faster the response, but also, the less information will be contained in the response. You can tweak this to find the right balance.
  • penalty -- Set at 5.5 by default, this value impacts how much the model will repeat itself in its output.
  • seedval — Default at 0, this is the seed for your model. If you leave it at 0, it will choose a random seed for each prompt generation.

Future Steps

The Lambda function deploys with the largest values that Lambda supports (10GB Memory). However, feel free to experiment with the models, the function configuration, and the input values to optimize for your Lambda consumption. As mentioned earlier, AWS provides 400k GB-s of free Lambda functions each month, which presents the opportunity to leverage Generative AI capabilities with minimal cost.

You can use CloudWatch to monitor your function. This allows you to see what’s going on and adjust settings accordingly.

So, there you have it. If you’ve been looking to deploy scalable, serverless LLM Generative AI inference engines, OpenLLaMa on Lambda offers an effective and cost-efficient solution. We hope you’ll find this project as exciting as we do. Enjoy experimenting and building with it!

Acknowledgment: This project wouldn’t be possible without the incredible work done by the teams at Hugging Face, and the developers behind the llama.cpp and llama-cpp-python project. We’re grateful for their contributions to the field of artificial intelligence.



Sean Bailey

Tinkerer, Thinker, Explorer, Lover of Food, Family, Fitness and Technology.