Implementing and Running Llama 3 with Hugging Face’s Transformers Library

Step-by-Step Guide to Run Llama 3 with Hugging Face Transformers

Manuel
4 min readMay 27, 2024

As part of the LLM deployment series, this article focuses on implementing Llama 3 with Hugging Face’s Transformers library. This library is one of the most widely utilized and offers a rich set of functionalities, making it an accessible and customizable tool for the local implementation of Llama 3.

Although the tutorial uses Llama-3–8B-Instruct, it works for any model you choose from Hugging Face. Llama-3–8B-Instruct corresponds to the 8 billion parameter model fine-tuned on multiple tasks such as summarization and question answering. Alternatively, you can use Llama-3–8B, the base model trained on sequence-to-sequence generation.

Hardware Requirements

You need at least 8 GB of GPU memory to follow this tutorial exactly. However, the methods and library allow further optimization. For more details, check out these resources:

Overview of the Transformers Library

Transformers is a powerful library packed with thousands of pre-trained models ready to tackle tasks across text, vision, and audio domains. Whether you need text classification, question answering, summarization, translation, or text generation in over 100 languages, Transformers has got you covered. It also excels in image tasks like classification, object detection, and segmentation, as well as audio tasks such as speech recognition and audio classification. Plus, it handles combined tasks like table question answering, OCR, video classification, and more.

Why You’ll Love Transformers

  • Easy to Use: Quickly download and use pre-trained models or fine-tune them on your data.
  • Versatile: Supports text, images, audio, and multimodal tasks.
  • Seamless Integration: Works with Jax, PyTorch, and TensorFlow, making it easy to switch between frameworks.
  • High Performance: Provides state-of-the-art models for various tasks with a low barrier to entry.
  • Cost-Effective: Save on compute costs by using shared models.

Points to Consider

  • Not Modular: Not designed for building neural nets from scratch.
  • Specific Training API: Optimized for its models, not for generic machine learning loops.
  • Examples Need Tweaking: Scripts might need adjustments to fit your specific needs.

Transformers make it simple to train, evaluate, and deploy powerful models with just a few lines of code, offering flexibility and high performance across a range of applications.

Installation

Create a Virtual Environment (Recommended)

First, create a virtual environment for your project. This step is optional if you already have one set up.

  1. Navigate to your project directory and create the virtual environment:
python -m venv env_name

2. Activate the environment:

env_name\Scripts\activate

Download the Model

  1. Install Hugging Face CLI:
pip install -U "huggingface_hub[cli]"

2. Create a Hugging Face account if you don’t have one (https://huggingface.co/) and generate an access token (https://huggingface.co/settings/tokens).

3. Log in to your Hugging Face account:

huggingface-cli login

4. Accept the model’s conditions and privacy policy for Llama-3–8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Wait until you are granted access. Usually takes around 15 minutes, but it can take more time.

5. Download the model (specify the path you want):

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct

Install Packages

First, install the necessary packages:

Windows (with CUDA support):

pip install torch --index-url <https://download.pytorch.org/whl/cu121>
pip install accelerate transformers bitsandbytes

Implementation

Pipeline

The pipeline method loads the model into memory and sets it up for inference.

  • model_id: Path to the model
  • torch_dtype: Specifies the data type of the weights

Option 1: Using Quantization (~8 GB)

Quantization reduces the hardware requirements by loading the model weights with lower precision. Instead of loading them in 16 bits (float16), they are loaded in 4 bits, significantly reducing memory usage from ~20GB to ~8GB.

self.pipeline = transformers.pipeline(
"text-generation",
model=self.model_id,
model_kwargs={
"torch_dtype": torch.float16,
"quantization_config": {"load_in_4bit": True},
"low_cpu_mem_usage": True,
},
)

Option 2: Without Quantization (~20 GB)

self.pipeline = transformers.pipeline(
"text-generation",
model=self.model_id,
model_kwargs={"torch_dtype": torch.float16},
)

Terminators

Terminators signal the end of a generated text sequence. They help the model know when to stop generating text. In the code:

self.terminators = [
self.pipeline.tokenizer.eos_token_id,
self.pipeline.tokenizer.convert_tokens_to_ids(""),
]

Response

LLMs typically receive a conversation as input. This conversation includes the message history with defined roles:

  • System: Initial instructions for the LLM
  • User
  • Assistant: LLM response
def get_response(
self, query, message_history=[], max_tokens=4096, temperature=0.6, top_p=0.9
):
user_prompt = message_history + [{"role": "user", "content": query}]
prompt = self.pipeline.tokenizer.apply_chat_template(
user_prompt, tokenize=False, add_generation_prompt=True
)
outputs = self.pipeline(
prompt,
max_new_tokens=max_tokens,
eos_token_id=self.terminators,
do_sample=True,
temperature=temperature,
top_p=top_p,
)
response = outputs[0]["generated_text"][len(prompt):]
return response, user_prompt + [{"role": "assistant", "content": response}]

Full Implementation Code

import torch
import transformers

class Llama3:
def __init__(self, model_path):
self.model_id = model_path
self.pipeline = transformers.pipeline(
"text-generation",
model=self.model_id,
model_kwargs={
"torch_dtype": torch.float16,
"quantization_config": {"load_in_4bit": True},
"low_cpu_mem_usage": True,
},
)
self.terminators = [
self.pipeline.tokenizer.eos_token_id,
self.pipeline.tokenizer.convert_tokens_to_ids(""),
]

def get_response(
self, query, message_history=[], max_tokens=4096, temperature=0.6, top_p=0.9
):
user_prompt = message_history + [{"role": "user", "content": query}]
prompt = self.pipeline.tokenizer.apply_chat_template(
user_prompt, tokenize=False, add_generation_prompt=True
)
outputs = self.pipeline(
prompt,
max_new_tokens=max_tokens,
eos_token_id=self.terminators,
do_sample=True,
temperature=temperature,
top_p=top_p,
)
response = outputs[0]["generated_text"][len(prompt):]
return response, user_prompt + [{"role": "assistant", "content": response}]

def chatbot(self, system_instructions=""):
conversation = [{"role": "system", "content": system_instructions}]
while True:
user_input = input("User: ")
if user_input.lower() in ["exit", "quit"]:
print("Exiting the chatbot. Goodbye!")
break
response, conversation = self.get_response(user_input, conversation)
print(f"Assistant: {response}")

if __name__ == "__main__":
bot = Llama3("your-model-path-here")
bot.chatbot()

Conclusion

In this article, we created a simple chatbot using Llama 3 and Hugging Face’s Transformers library. This implementation can be integrated into your application. For a more robust setup, consider creating a server using Flask or FastAPI to avoid restarting the application each time it crashes.

Stay tuned for a complete tutorial on creating a ChatGPT-like interface with a backend server and Streamlit UI!

References

--

--

Manuel

🚀 Hey there! I tend to write about technology, especially Artificial Intelligence, but don't be surprised if you stumble upon a variety of topics.