A Simple Guide to Run the LLama Model in a Docker Container

9 min readDec 28, 2023

Running the LLama Model in a Docker Container generated by DALL-E

Before we start, I’m assuming that you guys already have the concepts of containerization, large language models, and Python. This article is written with the understanding that you have a solid foundation in these areas, which will allow you to grasp the more advanced topics discussed in the following sections. Now, let’s dive into the details.

What is the LLama Model?

The LLaMA Model, which stands for Large Language Model Meta AI, is a substantial computer program trained on an extensive amount of text and code. Consider it as a super reader and writer capable of understanding and generating a wide array of information, such as poems, scripts, musical pieces, emails, and even computer code! Although LLaMA is still under development, it has already mastered many tasks, such as answering your questions, even if they’re peculiar or open-ended. A remarkable feature of LLaMA is its open-source nature, allowing anyone to access and study it. This transparency is crucial as it aids in understanding how these large language models function and ensure their responsible usage.

Why should I run LLama Model on the container?

Running the LLaMA Model on a container is like having a portable powerhouse for your AI tasks. Containers are similar to pre-packaged tools, and offer an easy setup and isolation, keeping LLaMA separate from other programs on your system for safety and stability. If you need more LLaMA power, you can simply spin up another container, making the scalability of your AI workload as easy as adding more boxes. Containers also offer flexibility across environments, allowing you to run your AI models on a local machine, cloud server, or anywhere in between. They are resource-efficient, sharing resources cleverly to remain lightweight and efficient, enabling you to run LLaMA without consuming all the memory or processing power.

So, if you’re seeking a convenient, efficient, and portable way to harness the power of the LLaMA Model, consider containers. It’s like having a personal AI toolkit, always ready to go wherever your creativity takes you.

Now that you have an understanding of the LLaMA Model and the convenience of running it in Docker, we can delve deeper. It’s time to explore the specific LLaMA Model that I’m using for the experiment.

LLama2b-7-chat-hf model:

This model is the product of Meta AI and is a part of the LLaMA family of large language models. This model, with its 7 billion parameters, is a generative text model fine-tuned and optimized for dialogue use cases. It’s been converted for the Hugging Face Transformers format, ensuring compatibility with the Hugging Face ecosystem. The model is designed to outperform open-source chat models on most benchmarks. In human evaluations for helpfulness and safety, it’s on par with popular closed-source models like ChatGPT and PaLM. The LLaMA2b-7-chat-hf model is a powerful tool for generating text and is widely used in the AI community.

Furthermore, it’s an excellent example of the advancements in AI, demonstrating the potential of large language models in various applications, from customer service to content creation. Its open-source nature fosters transparency and collaboration in the AI community, contributing to the ongoing evolution of AI technology. This model can be found on the Hugging Face platform. Before accessing the model, you must request a download from the Meta AI website using the same email address as your Hugging Face account. This model showcases the potential of large language models in various applications, from customer service to content creation.

Before we start the experiment, it’s important to note the specifications of the machine on which I’ll be running this experiment. Given that the LLaMA2b-7-chat-hf model is memory-intensive, it may not run on every system. Let’s ensure your machine has the necessary resources to handle this powerful model.

System specifications:

Storage: 100 GB ( model size 27 GB & docker image size approx 60 GB )
Os: Linux ( Ubuntu )
Processor: i5 11th Gen
CPU Cores: 12
RAM: 16 GB
Swap Memory: 40 GB
Graphic card: GeForce RTX 3050

Containerize the model:

In this experiment, I’ll be setting up a Flask web server that leverages the Hugging Face Transformers library to generate text. The server will be configured to accept POST requests at an endpoint. Upon receiving a request, the server will ensure that the necessary requirements are loaded and ready for use. It will then proceed to generate a sequence of text based on the provided via request. The generated text will be sent back as the server’s response. The server will be listening on all network interfaces, ready to process incoming requests and generate creative text.

Following are the steps to run the LLama2b-7-chat-hf model on the docker container:

Clone the repo of the model meta-llama/Llama-2–70b-chat-hf https://huggingface.co/meta-llama/Llama-2-70b-chat-hf using the following commands:

# install the git-lfs
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

# clone the repo
git clone https://huggingface.co/meta-llama/Llama-2-70b-chat-hf

Please note that cloning the repository may take a while due to the significant size of the model.

Now create the python file model.py and add the following code.

Code:

from flask import Flask, request, jsonify
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Create a Flask object
app = Flask("Llama server")

# Initialize the model and tokenizer variables
model = None
tokenizer = None

@app.route('/llama', methods=['POST'])
def generate_response():
    global model, tokenizer
    try:
        data = request.get_json()
        
        # Create the model and tokenizer if they were not previously created
        if model is None or tokenizer is None:
            # Put the location of to the LLAMA 7B model directory that you've downloaded from HuggingFace here
            model_dir = "/Path-to-your-model-dir/Llama-2-7b-hf"
                
            # Create the model and tokenizer
            tokenizer = AutoTokenizer.from_pretrained(model_dir)
            model = AutoModelForCausalLM.from_pretrained(model_dir)

        # Check if the required fields are present in the JSON data
        if 'prompt' in data and 'max_length' in data:
            prompt = data['prompt']
            max_length = int(data['max_length'])
            
            # Create the pipeline
            text_gen = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                torch_dtype=torch.float16,
                device_map="auto",)
            
            # Run the model
            sequences = text_gen(
                prompt,
                do_sample=True,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                max_length=max_length,
            )

            return jsonify([seq['generated_text'] for seq in sequences])

        else:
            return jsonify({"error": "Missing required parameters"}), 400

    except Exception as e:
        return jsonify({"Error": str(e)}), 500 

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)

This Python code sets up a web server using Flask that can generate text using the LLaMA2b-7-chat-hf model from Hugging Face’s Transformers library. The server is initialized with the name “Llama server”. It also initializes two variables, model and tokenizer, which will later be used to load the LLaMA model and its corresponding tokenizer.

The server has an endpoint, /llamawhich accepts POST requests. When a request is received, it first checks if the model and tokenizer have been loaded. If not, it loads them from a specified directory. Next, it checks if the request data contains a prompt and max_length. If these are present, it creates a text generation pipeline using the model and tokenizer and then uses this pipeline to generate a sequence of text based on the prompt. The generated text is then returned in the server’s response.

If the request data does not contain the required fields or if an error occurs during processing, the server returns an error message. Finally, the server is set to listen on all network interfaces (0.0.0.0) on port 5000.

Then create the file Dockerfile to create the image for the model.

Dockerfile


# Use python as base image
FROM python:3.8-slim-buster

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY ./model.py /app/model.py
COPY ./Llama-2-7b-hf /app/Llama-2-7b-hf

# Install the needed packages
RUN apt-get update && apt-get install -y gcc g++ procps
RUN pip install transformers Flask llama-cpp-python torch tensorflow flax sentencepiece nvidia-pyindex nvidia-tensorrt huggingface_hub accelerate

# Expose port 5000 outside of the container
EXPOSE 5000

# Run llama_7b_chat.py when the container launches
CMD ["python", "model.py"]

This Dockerfile is utilized to create a Docker image for a Python application. It commences with a base image of Python 3.8 on Debian Buster. The working directory in the Docker container is set to /app. Following this, the model.py file and the Llama-2–7b-hf directory are copied from the local machine to the /app directory in the Docker container. The Dockerfile updates the package lists for upgrades and new package installations and installs several packages using apt-get and pip. It exposes port 5000 for outside access. Finally, when the Docker container is launched, it executes the model.py script with Python.

Now run the following command to create the docker image using the Dockerfile:

# to build the image 
docker build -t llama-2-7b-chat-hf .

# to see the build images
docker images

Please note that building the image may take a while because required dependencies need to be installed.

After the image is created, run the following command to up the container:

# to run the container
docker run --name llama-2-7b-chat-hf -p 5000:5000 llama-2-7b-chat-hf

# to see the running containers
docker ps

The command is used to start a Docker container. The — name option assigns the name llama-2–7b-chat-hf to the new container. The -p option maps port 5000 of the host machine to port 5000 of the Docker container, allowing external access to the application running inside the container. The last argument, llama-2–7b-chat-hf, is the name of the Docker image that the container is based on.

After the container is up and running, use the following command to make a POST request to the model in the container:

# Curl command to make the Post request
curl -X POST -H "Content-Type: application/json" -d '{"prompt":"I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?", "max_length":200}' http://localhost:5000/llama

This is a curl command, it’s used to send a POST request to a web server running on localhost at port5000, specifically to the /llama endpoint. The -H "Content-Type: application/json" part of the command specifies that the data being sent to the server is in JSON format. The -d option is followed by the actual data being sent in the request. This data includes a prompt for the text generation model and a max_length that specifies the maximum length of the generated text. The prompt in this case is “I liked “Breaking Bad” and “Band of Brothers”. Do you have any recommendations of other shows I might like?”. The server will use this prompt to generate a response, which will be returned to the curl command.

After making the above request, I got the following response:

[
"I liked \u201cBreaking Bad\u201d and \u201cBand of Brothers\u201d. Do you have any recommendations of other shows I might like?\nMy favorite TV serie
s of all time is \u201cThe Wire\u201d and I also loved \u201cThe Sopranos \u201d.\nI also loved \u201cDeadwood\u201d and \u201cThe Shield\u201d.\nI\u2019
m not a big TV watcher, but I do have a few favorites: \n\u2013 The Wire (one of the best TV shows ever made)\n\u2013 The Sopranos (one of the best TV sh
ows ever made)\n\u2013 Deadwood (one of the best TV shows ever made)\n\u2013 Mad Men (one of the best TV shows ever made)\n\u2013 The Walking Dead (one
of the best TV shows ever made)\n \u2013 Game of Thrones (one of the best TV shows ever made)\n\u2013 Westworld (one of the best TV shows ever made)\n\u2
013 Stranger Things ("
]

The response from the model is a list of TV show recommendations based on the input prompt. The input prompt was “I liked ‘Breaking Bad’ and ‘Band of Brothers’. Do you have any recommendations of other shows I might like?”.

The model then generates a response that includes its favourite TV series and a list of other highly-rated TV shows. The shows mentioned are “The Wire”, “The Sopranos”, “Deadwood”, “The Shield”, “Mad Men”, “The Walking Dead”, “Game of Thrones”, “Westworld”, and “Stranger Things”. These shows are suggested because they have been critically acclaimed and are often recommended to fans of “Breaking Bad” and “Band of Brothers”.

Please note that the model requires a significant amount of memory to generate the answer. The above answer took approximately 1 hour and 50 minutes to generate and consumed more than 45 GB RAM during the execution.

Conclusion:

In conclusion, this experiment showcased the capabilities of the LLaMA2b-7-chat-hf model in a practical application. From successfully creating a Flask web server to generate text using this model and interacting with it using curl commands. The model was proficient in generating meaningful and relevant responses based on the prompts provided.

However, it’s crucial to note that due to the large size of the model, it consumes a significant amount of memory and time to generate responses. Despite these requirements, the experiment highlights the potential of large language models like LLaMA2b-7-chat-hf in various applications, from customer service to content creation, provided the necessary computational resources are available. Overall, the experiment was a success and offered valuable insights into the workings and capabilities of large language models.

Thanks for reading! 😄

Before you go!

Do you know what happens when you click and hold the clap 👏 button?
Follow me on Twitter, LinkedIn, and GitHub.