Effortlessly Serve Llama3 8B on CPU with vLLM: A Step-by-Step Guide

Learn to Deploy Llama3 8B using vLLM and Host it on a Web Server Compatible with OpenAI

Yevhen Herasimov
3 min readJul 4, 2024
Probably you studying Artificial Intelligence. Generated by Freepik

Large Language Models (LLMs) like Llama3 8B are pivotal natural language processing tasks. Serving these models on a CPU using the vLLM inference engine offers an accessible and efficient way to deploy powerful AI tools without needing specialized hardware, GPUs. In this guide, I’ll walk you through the steps to set up and serve your LLM using vLLM and leveraging an OpenAI-compatible web server.

Step 1: Clone vLLM Github repository locally

git clone https://github.com/vllm-project/vllm.git

Step 2: Create a Dockerfile in project root directory

FROM python:3.9-slim

# Set environment variables
ENV VLLM_TARGET_DEVICE=cpu

# Install dependencies
RUN apt-get update -y && \
apt-get install -y gcc-12 g++-12 && \
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

# Upgrade pip and install required Python packages
RUN pip install --upgrade pip && \
pip install wheel packaging ninja "setuptools>=49.4.0" numpy

# Copy the rest of the application into the container
COPY ./vllm ./vllm

WORKDIR ./vllm

# Install dependencies and build
RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
RUN python setup.py install

# Expose webserver on port 8000
EXPOSE 8000

# Run OpenAI-compatible webserver
CMD ["python", "vllm/entrypoints/openai/api_server.py"]

Note: Dockerfile is configured to automatically serve LLM using OpenAI compatible HTTP web server on port 8000.

You’ll have a following project structure:

├───vllm # cloned repo
└───Dockerfile

Step 3: Build Docker image

For this step you need to have Docker installed.

docker build -t llm-serving:vllm-cpu .

Step 4: Get access to download Hugging Face models

Some models on Hugging Face are Gated Models. In order to gain access you have to accept agreement form on the model’s Hugging Face page.
Link: Llama3–8B-Instruct Model on Hugging Face

Gated model access prompt

Step 5: Run Docker container

Following command launches container from created Docker image.

docker run --rm --env "HF_TOKEN=<your_huggingface_token>" \
--ipc=host \
-p 8000:8000 \
llm-serving:vllm-cpu \
--model meta-llama/Meta-Llama-3-8B-Instruct

Command arguments explained:

  • --rm: This flag tells Docker to automatically remove the container when it stops. Useful for container cleanup.
  • --env "HF_TOKEN=<your_huggingface_token>": You need to replace <your_huggingface_token> with your actual Hugging Face API token.
  • --ipc=host: This flag sets the IPC (Inter-Process Communication) mode to host. This can improve performance for certain types of applications, especially those involving heavy memory usage and inter-process communication.

You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.

  • -p <host_port>:<container_port>: This argument maps a port on your local machine to a port inside the container.
  • --model meta-llama/Meta-Llama-3-8B-Instruct: Hugging Face model to be used for inference

Step 6: Test the model

Curl command:

curl --location 'http://localhost:8000/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "An apple a day",
"max_tokens": 100,
"temperature": 0.7
}'

--

--

Yevhen Herasimov

Machine Learning Engineer specializing in LLMs and NLP