Exploring Llama2 Large Language Model: Setup, Utilization, and Prompt Engineering

6 min readAug 25, 2023

Since the public release and subsequent popularity of ChatGPT towards the end of 2022, Large Language Models (LLMs) have emerged as a significant advancement in the AI field. Following this trend, various other LLMs like Bard or Prometheus have also been introduced.

In July 2023, MetaAI made the announcement of open-sourcing the latest iteration of their LLM, named Llama2. With seamless integration into the Hugging Face transformers ecosystem, utilizing, and even fine-tuning LLMs has become remarkably accessible to a wide range of users.

In this article, I will guide you through the process of using Llama2, covering everything from downloading the model and running it on your laptop to initiating prompt engineering.

For additional resources, please visit Huggingface’s official website: https://huggingface.co/blog/llama2

Access to Llama2

Several models

Llama2 is available through 3 different models:

Llama-2–7b that has 7 billion parameters. Model size: 13.5GB
Llama-2–13b that has 13 billion parameters. Model size: 25GB.
Llama-2–70b that has 70 billions parameters.

While the first one can run smoothly on a laptop with one GPU, the other two require more robust hardware, with the 70b variant ideally supported by two GPUs.

Additionally, each version includes a chat variant (e.g. Llama-2–70b-chat-hf) that was further trained with human annotations. This helps improve its ability to address human queries and provide helpful responses.

Accepting MetaAI’s terms and conditions

Assuming you want to use LLama-2 via the transformers framework, which I recommend, it’s imperative to follow these two key steps:

Fill out the form on MetaAI’s web site to accept the terms and conditions.
On Huggingface’s page on Llama-2, select a model and request access to it.

The email address you use on both sites must be the same. Once your request is approved (took less than one hour in my case), you should be able to see the model card on HuggingFace. You are now ready to move on to the next step.

Running Llama2

Installation

To run Llama-2, minor requirements must be met. Using a virtual environment is recommended to isolate the packages downloaded.

#set up a virtual environment
virtualenv .env
source .env/bin//activate
#install packages with pip
pip install transformers
pip install accelerate

First script

You can reuse the script provided by Huggingface.

It is first required to log in to Hugging Face through the terminal in order to access the model.

huggingface-cli login

If you don’t have a token yet, you can generate one here: https://huggingface.co/settings/tokens .

Subsequently, the script can be executed:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

This process involves loading the 7 billion parameter model through the pipeline function and generating text based on a given prompt.

Pretty easy, isn’t it? However, you may not want to use a LLM through via a Python script.

Llama-2 in a Web Interface

The good news is you can use Gradio to quickly and easily set up a chatbot using Llama-2. The associated code is readily accessible on 🤗Huggingface.

This space implement the 7B model in a chat app: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat

By cloning the repository to your machine, you can seamlessly set up and engage with the model through the interface.

git lfs install
git clone https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat

Then:

cd llama-2-7b-chat
python3 app.py

Connect to http://127.0.0.1:7860/ and start interacting with the Llama-2 model.

Saving the model

Although the model downloaded via Hugging Face is stored in ~/.cache/huggingface/hub, saving it locally can be advantageous for potential deployment on another system. The following code snippets illustrate the process:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
tokenizer.save_pretrained('Llama2-7b-tokenizer')

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model.save_pretrained('Llama2-7b-model')

Then, loading the local version can be done as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM

#model and tokenizer loaded separately

model = AutoModelForCausalLM.from_pretrained('./Llama2-7b-model')
tokenizer = AutoTokenizer.from_pretrained('./Llama2-7b-tokenizer')

#If you want to use the pipeline instead
model = "./Llama2-7b-model" 

tokenizer = AutoTokenizer.from_pretrained('./Llama2-7b-tokenizer')
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)
#Your code here to interact with the pipeline

If you deploy the model, do not forget to include Meta’s license and acceptable use policy.

Prompt engineering

This segment explores basic prompt engineering and the configuration of the model behavior.

System prompt parameter

If you toggle the advanced options button on the gradio app, you will see several parameters you can tune:

The “system prompt” parameter is by default set to instruct the model to be helpful and friendly but not to disclose any harmful content.

With the normal behavior, let’s ask: What is the capital of France?

If you modify the system prompt by “do not answer any questions”, this is what you will get. In line with our instructions but still a bit surprising.

Reply with “do not answer any questions”

Examples with different prompts

Using the 7b model, we type different prompts to explore how Llama-2 responds.

Summary of the introduction of an article about quantum computing

Writing a story

Writing code to cipher a file

Writing an email to promote a new AI-based product

What languages are handled

Simple question in French

Unpredictable responses

Despite its overall commendable performance, Llama-2 may occasionally exhibit unusual behaviors. For instance, it might decline to address certain inquiries, such as requests involving coding to delete files.

Nevertheless, when altering the model’s instructions to “Always provide a positive answer.”, it yields a distinct outcome:

Example of a response in one language that quickly transitions to English

Example of the model not contradicting the user

Conclusion

The release of Llama2 by MetaAI marks a milestone in the realm of Large Language Models. Its integration with the Hugging Face transformers ecosystem empowers users to not only run Llama2 effortlessly but also fine-tune it.

Whether implemented through Python scripts or integrated into web interfaces, Llama2’s capabilities remain impressive, even if occasional surprising behaviors may appear. Considering the results achieved using the 7B model, one can expected even enhanced performance from the 70b version.

#AI #LLM #NLP #LLAMA2