Creating Virtual Assistance using with Llama-2 7B Chat Model

6 min readDec 15, 2023

I know many of us are already hands-on using OpenAI to build virtual assistance, but do know if Meta has released their own LLM to compete against OpenAI model?

Llama 2

Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1. Its fine-tuned models have been trained on over 1 million human annotations.

Llama 2 outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.

Llama2 has 2 models type:
1. Llma Chat
2. Llama Code
Both models has multiple size/parameter such as 7B, 13B, and 70B.

Hugging Face (HF)

Hugging Face is more than an emoji: it’s an open source data science and machine learning platform. It acts as a hub for AI experts and enthusiasts — like a GitHub for AI.

Hugging Face evolved over the years to be a place where you can host your own AI models, train them, and collaborate with your team while doing so. It provides the infrastructure to run everything from your first line of code to deploying AI in live apps or services. On top of these features, you can also browse and use models created by other people, search for and use datasets, and test demo projects.

The most exciting part of LLama2 in Hugging Face is the fine-tuned models (Llama 2-Chat), which have been optimized for dialogue applications using Reinforcement Learning from Human Feedback (RLHF). Across a wide range of helpfulness and safety benchmarks, the Llama 2-Chat models perform better than most open models and achieve comparable performance to ChatGPT according to human evaluations. You can read the paper here.

Due to community feedback, the Llama2–7B models have evolved into the Llama2–7B-HF variant. Because of the amazing performance and community-fine-tuning of this model, I strongly suggest utilizing it.

Download LLama2–7B-Chat

Navigate to this page to download the model. You may need to clone the project and you can do this by performing Git syntax.

Note: If you can’t access the page, that means you need to send a request first before downloading the files.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Wait until process is finish then we can continue to the next step after that.

Packages

We have several dependencies that we need to install, let’s install this package first.

pip3 install transformers==4.31.0
pip3 install accelerate
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

Once we install all packages, we need to include it to our code.

from transformers import AutoTokenizer
import transformers
import torch

Now it is settled and let’s move to the next step.

Build Pipeline

When we are using Llama2 from HF, we need to create a pipeline and update our LLama2 model location accordingly. This is how we can build the pipeline.

def build_pipeline(model_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_dir,
        torch_dtype=torch.float32,
        # torch_dtype=torch.float16,
        device_map="auto",
    )
    return pipeline, tokenizer

If you have Cuda installed in your computer or server, you can use half of GPU utilization by changing torch.float32 into torch.float16. But if you don’t have it, you can use this default parameter.

Consume Pipeline

HF model is using sequence to predict the result from Llama2. After we pass the Pipeline parameters, we may need to perform looping or iteration. We can simply change some parameters in example question, top_k, max_length, etc. to meet with our requirements.

from datetime import datetime

def find_sequence(pipeline, tokenizer, question):
    print("Start Pipeline =", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
    sequences = pipeline(
        question,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
    )
    print("End Pipeline =", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))

    print("Start Sequences =", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    print("End Sequences =", datetime.now().strftime("%d/%m/%Y %H:%M:%S"))

As you can see, we use date time to track the time/duration on how long our model can generate the result. We can increase the performance by leveraging GPU and Pipeline parameters.

Main Program

We have defined 2 function to build and consume the pipeline, so we need the main program to call all of these functions with different prompt. Do not forget to change the model folder path according your location.

if __name__ == "__main__":
    # Model folder path and change this accordingly
    model_dir = "C:\Llama-2-7b-chat-hf"

    # Define pipeline and tokenizer
    pipeline, tokenizer = build_pipeline(model_dir)

    # Ask question
    question = 'You are expert in fashion". Do you have any recommendations of jackets or coats when we are in winter season?'
    find_sequence(pipeline, tokenizer, question)

    # Ask question
    question = 'I liked "Breaking Bad". Do you have any recommendations of other shows I might like?'
    find_sequence(pipeline, tokenizer, question)

Output

If we run the main program, it will product this results:

C:\Users\dmitr\Py-LangChain-ChatGPT-VirtualAssistance\Scripts\python.exe "D:\00 Project\00 My Project\IdeaProjects\Py-LangChain-ChatGPT-VirtualAssistance\02_VA_Llama2_7b_chat_hf.py" 
Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.29s/it]
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.1.1+cpu)
    Python  3.11.6 (you have 3.11.4)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details

Start Pipeline = 14/12/2023 05:21:19
End Pipeline = 14/12/2023 05:26:02
Start Sequences = 14/12/2023 05:26:02
Result: You are expert in fashion". Do you have any recommendations of jackets or coats when we are in winter season?

Answer: Of course! As a fashion expert, I can definitely recommend some great jackets and coats for winter. Here are a few options that are both stylish and warm:

1. Down-filled jacket: A classic choice for winter, a down-filled jacket is lightweight, warm, and packable. Look for one with a water-resistant treatment to protect against rain or snow.
2. Parka coat: A parka coat is a great option for winter as it provides both warmth and protection from the elements. Look for one with a waterproof and breathable membrane, such as Gore-Tex or similar technology, and a hood to keep your head and neck warm.
3. Trench coat: While a trench coat may not seem like an obvious choice for winter, it
End Sequences = 14/12/2023 05:26:02

Start Pipeline = 14/12/2023 05:26:02
End Pipeline = 14/12/2023 05:30:36
Start Sequences = 14/12/2023 05:30:36
Result: I liked "Breaking Bad". Do you have any recommendations of other shows I might like?

I'm open to anything, but I do have a few preferences. I like shows that are more on the darker side, with complex characters and interesting storylines. I also appreciate shows that have a strong sense of atmosphere and setting.

Do you have any recommendations?

Answer:

Oh, absolutely! If you enjoyed "Breaking Bad," here are some other shows you might like:

1. "The Sopranos" - This HBO series is a classic crime drama that explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.
2. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA
End Sequences = 14/12/2023 05:30:36

Process finished with exit code 0

Using my local computer, it takes around 5 minutes to produce the result but I think we can still fine-tuned this Llama2 by changing some configuration.

Summary

Llama2 from Hugging Face is an advanced language model designed to handle complex language processing tasks. It builds upon the successes of its predecessors, leveraging sophisticated algorithms and a vast dataset to understand and generate human-like text. Llama2 excels in tasks such as answering questions, summarizing information, and generating coherent and contextually relevant text. It’s particularly noteworthy for its improved understanding of context, ability to handle nuanced queries, and its efficient use of computational resources compared to earlier models. This makes it a valuable tool in a wide range of applications, from conversational AI to content creation and data analysis.

GitHub Repository

https://github.com/dmitrimahayana/Py-LangChain-ChatGPT-VirtualAssistance

If you like this article, you can subscribe or follow me.
Thank you