The Mistral-7B-v0.1 Large Language Model (LLM) boasts 7 billion parameters, as indicated in the Hugging Face model card. However, the original model is demanding in terms of computational resources, making it impractical to load in the free version of Colab due to its memory and GPU limitations.
TL;DR
Integrating with bitsandbytes, we’ve integrated techniques from the LLM.int8 paper into transformers, enabling users to run models in 4-bit precision. This extends to a wide range of Hugging Face models across different modalities. The QLoRA paper by Dettmers et al. introduces this method, allowing users to load Mistral on the free version of Google Colab more efficiently. Github link below :-).
Install the dependecies
!pip install -q -U langchain transformers bitsandbytes accelerate
Import Libraries
import torch
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
Define Quantization config
The quantization configuration specifies settings for loading a model in 4-bit precision, utilizing torch.float16 for computation, with “nf4” quantization type and double quantization enabled, aiming to enhance efficiency and reduce memory footprint during model deployment.
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
Load the 4bit model and Tokenizer
model_4bit = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.1", device_map="auto",quantization_config=quantization_config, )
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
The code initializes a Mistral-7B-Instruct-v0.1 model for causal language modeling from the “mistralai/Mistral-7B-Instruct-v0.1” checkpoint, incorporating the specified quantization configuration to load the model in 4-bit precision, with automatic device mapping. Additionally, it loads the corresponding tokenizer for the model.
Create the hugging face pipeline
The code sets up a text generation pipeline using Mistral-7B-Instruct-v0.1 model loaded in 4-bit precision, along with its corresponding tokenizer, enabling caching and automatic device mapping. It configures the generation parameters such as maximum length, sampling, top-k sampling, and number of returned sequences. Additionally, it initializes a LangChain HuggingFacePipeline for streamlined text generation tasks.
pipeline_inst = pipeline(
"text-generation",
model=model_4bit,
tokenizer=tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
llm = HuggingFacePipeline(pipeline=pipeline_inst)
Define a template and a helper function
The provided generate_response
function utilizes a template and a LangChain setup to generate responses to questions. The template incorporates a placeholder for the question, which is then filled in using the LangChain to generate a response based on the given context. The function returns the generated response.
template = """<s>[INST] You are an respectful and helpful assistant, respond always be precise, assertive and politely answer in few words conversational english.
Answer the question below from context below :
{question} [/INST] </s>
"""
def generate_response(question):
prompt = PromptTemplate(template=template, input_variables=["question","context"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
response = llm_chain.run({"question":question})
return response
Test your Model
generate_response("Name one president of america?")
#OUTPUT
'\nOne president of the United States is George Washington.'
Thank you for your attention. The notebook is accessible on the GitHub link provided below.