Effortlessly Serve Llama3 8B on CPU with vLLM: A Step-by-Step Guide
Learn to Deploy Llama3 8B using vLLM and Host it on a Web Server Compatible with OpenAI
Large Language Models (LLMs) like Llama3 8B are pivotal natural language processing tasks. Serving these models on a CPU using the vLLM inference engine offers an accessible and efficient way to deploy powerful AI tools without needing specialized hardware, GPUs. In this guide, I’ll walk you through the steps to set up and serve your LLM using vLLM and leveraging an OpenAI-compatible web server.
Step 1: Clone vLLM Github repository locally
git clone https://github.com/vllm-project/vllm.git
Step 2: Create a Dockerfile in project root directory
FROM python:3.9-slim
# Set environment variables
ENV VLLM_TARGET_DEVICE=cpu
# Install dependencies
RUN apt-get update -y && \
apt-get install -y gcc-12 g++-12 && \
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
# Upgrade pip and install required Python packages
RUN pip install --upgrade pip && \
pip install wheel packaging ninja "setuptools>=49.4.0" numpy
# Copy the rest of the application into the container
COPY ./vllm ./vllm
WORKDIR ./vllm
# Install dependencies and build
RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
RUN python setup.py install
# Expose webserver on port 8000
EXPOSE 8000
# Run OpenAI-compatible webserver
CMD ["python", "vllm/entrypoints/openai/api_server.py"]
Note: Dockerfile is configured to automatically serve LLM using OpenAI compatible HTTP web server on port 8000.
You’ll have a following project structure:
├───vllm # cloned repo
└───Dockerfile
Step 3: Build Docker image
For this step you need to have Docker installed.
docker build -t llm-serving:vllm-cpu .
Step 4: Get access to download Hugging Face models
Some models on Hugging Face are Gated Models. In order to gain access you have to accept agreement form on the model’s Hugging Face page.
Link: Llama3–8B-Instruct Model on Hugging Face
Step 5: Run Docker container
Following command launches container from created Docker image.
docker run --rm --env "HF_TOKEN=<your_huggingface_token>" \
--ipc=host \
-p 8000:8000 \
llm-serving:vllm-cpu \
--model meta-llama/Meta-Llama-3-8B-Instruct
Command arguments explained:
--rm
: This flag tells Docker to automatically remove the container when it stops. Useful for container cleanup.--env "HF_TOKEN=<your_huggingface_token>"
: You need to replace<your_huggingface_token>
with your actual Hugging Face API token.--ipc=host
: This flag sets the IPC (Inter-Process Communication) mode tohost
. This can improve performance for certain types of applications, especially those involving heavy memory usage and inter-process communication.
You can either use the
ipc=host
flag or--shm-size
flag to allow the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.
-p <host_port>:<container_port>
: This argument maps a port on your local machine to a port inside the container.--model meta-llama/Meta-Llama-3-8B-Instruct
: Hugging Face model to be used for inference
Step 6: Test the model
Curl command:
curl --location 'http://localhost:8000/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "An apple a day",
"max_tokens": 100,
"temperature": 0.7
}'