Instruction Fine-Tuning Gemma-2B on Medical Reasoning and Convert the finetuned model into GGUF format using Llama.cpp

Published in

The AI Forum

33 min readMar 10, 2024

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is inspired by Gemini, and the name reflects the Latin gemma, meaning “precious stone.” Accompanying our model weights, we’re also releasing tools to support developer innovation, foster collaboration, and guide responsible use of Gemma models.

gemma-7b: Base 7B model.
gemma-7b-it: Instruction fine-tuned version of the base 7B model.
gemma-2b: Base 2B model.
gemma-2b-it: Instruction fine-tuned version of the base 2B model.

Prerequisite : GPU: gemma-2b — can be finetuned on T4(free google colab) while gemma-7b requires an A100 GPU.

Here I have performed instruction finetuning on gemma-2b-it using mamachang/medical-reasoning dataset. The implementation was done on google colab using V100 GPU.

Most Language Models are too big to be fine-tuned on consumer hardware. For instance in order to fine-tune a 65 billion parameters model we would require more than 780 GB of GPU memory. This is equivalent to 10 A100 80 GB GPUs.

Now with Parmeter Efficient techniques like LoRA and QLoRA it has become easier to fine-tune models on consumer hardware.

LoRA adds a tiny amount of trainable parameters i.e. adapters for each layer of the LLM and freezes all the original parameters. For fine-tuning we only have to update the adapter weight which significantly reduces the memory footprint.

While QLoRA goes three steps further by introducing 4bit quantization, double quantization and the exploitation of NVidia unified memory for paging.

4-bit NormalFloat Quantization: it ensures an equal number of values in each quantization bin to avoid computational issues and errors for outlier values.
Double quantization: It is the process of quantizing the quantization constants for additional memory savings.
Paging with unified memory: It relies on the NVIDIA unified memory feature and automatically handles page-to-page transfers between CPU and GPU.

Technology Stack Used:

Basic steps Involved in fine-tuning:

load the base model.
train the base model.
save the LoRA adapter.
reload the base model at half/full precision.
merge the LoRA weights with the base model.
Save the merged model and push to Hugging face Hub.

Code Implementation

Install required dependencies.

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

Setup HF_Tokens

import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

Import Required dependencies.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#set the qunatization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
#
#Load the model and Tokenizer
model_id = "google/gemma-2b-it"
#
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

Load the training dataset.

from datasets import load_dataset
#
dataset = load_dataset("mamachang/medical-reasoning")
dataset

####################################################
DatasetDict({
    train: Dataset({
        features: ['input', 'instruction', 'output'],
        num_rows: 3702
    })
})

dataset["train"][0]
###################################################################
{'input': "Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'},",
 'instruction': 'Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.',
 'output': '<analysis>\n\nThis is a clinical vignette describing an 8-year-old boy with acute lymphoblastic leukemia who recently started chemotherapy and now presents with nausea, vomiting, decreased urination, bilateral pedal edema, and other vital sign changes. \n\nThe question asks which serum studies and urinalysis findings would help confirm the diagnosis. Based on the clinical history, the main diagnostic consideration is tumor lysis syndrome, which can occur after starting chemotherapy in a patient with a high tumor burden. \n\nTumor lysis syndrome leads to rapid cell breakdown and release of intracellular contents into the bloodstream. This results in hyperuricemia, hyperkalemia, hyperphosphatemia and acute kidney injury. The urine may show urate crystals. \n\nSo the correct answer should include these key lab abnormalities of tumor lysis syndrome.\n</analysis>\n<answer>\nC: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine\n</answer>'}

Convert HF dataset to pandas Data Frame

df = dataset["train"].to_pandas()
df.head(10)

Create a prompt for training.

def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """

    # Generate prompt
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    # Without
    else:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset["train"]]
dataset = dataset["train"].add_column("prompt", text_column)
dataset
########################################################################
Dataset({
    features: ['input', 'instruction', 'output', 'prompt'],
    num_rows: 3702
})

dataset[0]['prompt']
#######################################################################
 'prompt': "<start_of_turn>user Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}, <end_of_turn>\n<start_of_turn>model<analysis>\n\nThis is a clinical vignette describing an 8-year-old boy with acute lymphoblastic leukemia who recently started chemotherapy and now presents with nausea, vomiting, decreased urination, bilateral pedal edema, and other vital sign changes. \n\nThe question asks which serum studies and urinalysis findings would help confirm the diagnosis. Based on the clinical history, the main diagnostic consideration is tumor lysis syndrome, which can occur after starting chemotherapy in a patient with a high tumor burden. \n\nTumor lysis syndrome leads to rapid cell breakdown and release of intracellular contents into the bloodstream. This results in hyperuricemia, hyperkalemia, hyperphosphatemia and acute kidney injury. The urine may show urate crystals. \n\nSo the correct answer should include these key lab abnormalities of tumor lysis syndrome.\n</analysis>\n<answer>\nC: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine\n</answer> <end_of_turn>"

Shuffle the dataset.

dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Train-Test Split

dataset = dataset.train_test_split(test_size=0.1)
train_data = dataset["train"]
test_data = dataset["test"]
print(train_data)
print(test_data)

#########################################################################
Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 3331
})

Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 371
})

Load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and the prepare_model_for_kbit_training method from PEFT.

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
#
print(model)

#

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
####################################################################
GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=256000, bias=False)
)

Retrieve the target modules.

import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)
#
modules = find_all_linear_names(model)
print(modules)
##############################################################################
['down_proj', 'k_proj', 'o_proj', 'gate_proj', 'q_proj', 'v_proj', 'up_proj']

List the number of trainable parameters.

trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

###############################################################################
Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%

Initiate the training.

import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side='right'
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    max_seq_length=2500,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
#
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

################################################################################
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
 [100/100 05:02, Epoch 0/1]
Step Training Loss
1 3.435700
2 3.578300
3 2.680100
4 2.418400
5 2.281500
6 2.156300
7 2.042900
8 1.811600
9 1.575100
10 1.622300
11 1.775600
12 1.574400
13 1.429200
14 1.413500
15 1.459100
16 1.413700
17 1.643500
18 1.393100
19 1.655600
20 1.427500
21 1.416000
22 1.413800
23 1.355800
24 1.369500
25 1.378500
26 1.272900
27 1.397200
28 1.340600
29 1.246100
30 1.420800
31 1.267500
32 1.399900
33 1.383600
34 1.245400
35 1.384200
36 1.342100
37 1.339900
38 1.235100
39 1.330200
40 1.355300
41 1.259200
42 1.281900
43 1.253000
44 1.323700
45 1.299300
46 1.242600
47 1.097000
48 1.502300
49 1.350000
50 1.385400
51 1.343200
52 1.296500
53 1.278000
54 1.327200
55 1.279600
56 1.409300
57 1.221200
58 1.384700
59 1.110500
60 1.173100
61 1.224300
62 1.327900
63 1.395600
64 1.119300
65 1.230300
66 1.224300
67 1.136700
68 1.247000
69 1.267700
70 1.164700
71 1.112800
72 1.108900
73 1.399600
74 1.368600
75 1.181800
76 1.224000
77 1.193600
78 1.219400
79 1.360000
80 1.185200
81 1.209000
82 1.180600
83 1.307800
84 1.241100
85 1.345200
86 1.175000
87 1.190100
88 1.172300
89 1.265700
90 1.265800
91 1.196400
92 1.350200
93 1.189200
94 1.176500
95 1.215200
96 1.240200
97 1.184000
98 1.221600
99 1.158300
100 1.310200
TrainOutput(global_step=100, training_loss=1.407856843471527, metrics={'train_runtime': 306.785, 'train_samples_per_second': 1.304, 'train_steps_per_second': 0.326, 'total_flos': 2311407065296896.0, 'train_loss': 1.407856843471527, 'epoch': 0.12})

from huggingface_hub import notebook_login
notebook_login()

new_model = "gemma-medical_qa-Finetune" 
#
trainer.model.save_pretrained(new_model)
#
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
#save_adapter=True, save_config=True
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Test Finetuned model.

Helper function to generate the response.

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)
#
query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""

result = get_completion(query=query, model=merged_model, tokenizer=tokenizer)
print(result)
#
#########################################################################
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

  user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.

 Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? 
{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}
  
model

   analysing the question stem, we know we are dealing with acute lymphoblastic leukemia in a 8-year-old boy who received a 1st dose of chemotherapy 5 days ago. So the answer key should provide information about the differential diagnosis of lymphoblastic leukemia and specific testing needed at this stage of diagnosis. 

 From the choices, choices A and D are the more pertinent. Choice A is not the correct serum finding. Choice D is an abnormal serum finding that would support an underlying acute lymphoblastic leukemia diagnosis. Choice E gives an abnormal urine finding along with specific laboratory labs. 

 The correct answer would either be choice D, suggesting testing to confirm leukemia infection, or choice E, relating to a specific symptom associated with acute lymphoblastic leukemia, such as urinary tract symptoms.

print(f"Model Answer : \n {result.split('model')[-1]}")

############################################################################
Model Answer : 

   analysing the question stem, we know we are dealing with acute lymphoblastic leukemia in a 8-year-old boy who received a 1st dose of chemotherapy 5 days ago. So the answer key should provide information about the differential diagnosis of lymphoblastic leukemia and specific testing needed at this stage of diagnosis. 

 From the choices, choices A and D are the more pertinent. Choice A is not the correct serum finding. Choice D is an abnormal serum finding that would support an underlying acute lymphoblastic leukemia diagnosis. Choice E gives an abnormal urine finding along with specific laboratory labs. 

 The correct answer would either be choice D, suggesting testing to confirm leukemia infection, or choice E, relating to a specific symptom associated with acute lymphoblastic leukemia, such as urinary tract symptoms.

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(query=query, model=merged_model, tokenizer=tokenizer)
print(f"Model Answer : \n {result.split('model')[-1]}")

#############################################################################
Model Answer : 

   Analysis:

This is a clinical vignette describing a 34-year-old man with symptoms of abdominal discomfort and blood in the urine. He has a history of hypertension for which he is prescribed medication. On physical exam, he has enlarged flank masses bilaterally. The bowel sounds are normal. Renal function tests show normal urea and creatine levels. The renal ultrasound and CT scans are abnormal. 

The key findings in the question stem are:
- Hypertension for 2 years 
- 34-year-old 
- Right-sided kidney tumors with normal bowel sounds
- Enlarged kidneys and multiple cystic cysts on renal ultrasound and CT scan

According to renal sonography, CT scan findings, and normal bowel sounds, the diagnosis is complex cystic nephropathy. The question asks about the most likely presentation, which is polycystic disease given the multiple cysts and right-sided kidney lesions.

Based on these findings, the most likely diagnosis is Autosomal Dominant Polycystic Kidney Disease (ADPKD), which is characterized by large right-sided kidneys and urinary bladder cystic masses. The other choices can be ruled out with less probability. 
 
<analysis>

This is a question about renal cysts in an adult male patient with hypertension. The key findings are:
- 34-year-old male
- History of 2 years of hypertension
- Enlarged and hypercodense right kidney on ultrasound
- Multiple cysts on CT scan with well-defined walls

According to the image, the cysts are likely polycystic in nature, as they are located in the right kidney. The right kidney is enlarged, which may also indicate polycystic disease. The hypercodense cysts on CT scan further support the diagnosis. ADPKD is an autosomal dominant condition in which individuals with the genotype ADPKDIV have multiple cysts in their right kidneys. The other choices can be ruled out.
</analysis>
<answer>
A: Autosomal dominant polycystic kidney disease (ADPKD)
</answer> 

acherous note: If the cysts were multiple in another location, such as the left kidney, this disease would possibly not be included

Inference for finetuned model gemma

from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
#set the LoRA configurations
peft_config =LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
#
peft_model_id = "Plaban81/gemma-medical_qa-Finetune"
config = peft_config.from_pretrained(peft_model_id)
#
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             )
ptokenizer= AutoTokenizer.from_pretrained(peft_model_id)

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(query=query, model=model, tokenizer=ptokenizer)
print(f"Model Answer : \n {result.split('model')[-1]}")

print(result)

###########################################################################
user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:
Urea 50 mg/dL
Creatinine 1.4 mg/dL
Protein Negative
RBC Numerous
The patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? 
{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}
  
model

  <Answer:A> The most likely diagnosis is **'Autosomal dominant polycystic kidney disease (ADPKD)'.**

<Analysis>:
In ADPKD, an abnormal gene mutation is responsible for the excessive growth of fluid-filled cysts in the kidneys. These cysts can be detected through various imaging techniques, including ultrasound, CT scan, and MRI. The presence of multiple renal cysts and enlarged kidneys is characteristic of ADPKD.

Convert to GGUF format.

import locale
def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!git clone https://github.com/ggerganov/llama.cpp

!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements/requirements-convert-hf-to-gguf.txt

Download Model

from huggingface_hub import snapshot_download
model_name = "Plaban81/gemma-medical_qa-Finetune"
methods = ['q4_k_m']
base_model = "./original_model/"
quantized_path = "./quantized_model/"
#
snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)
original_model = quantized_path+'/FP16.gguf'

Create a new folder.

!mkdir ./quantized_model/

Convert hf to gguf

!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf


#################################################################################
Loading model: original_model
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Setting special token type bos to 2
gguf: Setting special token type eos to 1
gguf: Setting special token type unk to 3
gguf: Setting special token type pad to 1
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to True
gguf: Setting chat_template to {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Exporting model to 'quantized_model/FP16.gguf'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.float16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.0.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.1.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.10.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.11.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.12.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.13.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.14.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.15.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.16.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.2.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.3.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.4.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.5.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.6.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.7.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.8.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float32
blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float32
blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float32
blk.9.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float32
blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float32
blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float32
blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float32
gguf: loading model part 'model-00002-of-00002.safetensors'
blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float32
blk.17.ffn_norm.weight, n_dims = 1, torch.float16 --> float32
output_norm.weight, n_dims = 1, torch.float16 --> float32
Model successfully exported to 'quantized_model/FP16.gguf'

Quantize the finetuned model to 4bit quantization format.

import os
for m in methods:
  qtype = f"{quantized_path}/{m.upper()}.gguf"
  os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)

#
! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt

##########################################################################
Log start
main: build = 2355 (e04e04f8)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709783565
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from ./quantized_model/Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = original_model
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          gemma.block_count u32              = 18
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q4_K:  108 tensors
llama_model_loader: - type q6_K:   19 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 1.51 GiB (5.18 BPW) 
llm_load_print_meta: general.name     = original_model
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 1 '<eos>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.06 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/19 layers to GPU
llm_load_tensors:        CPU buffer size =  1548.98 MiB
........................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     6.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   504.00 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
 repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 512, n_predict = 90, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:How are you ?
Bob: I am doing well, thank you. And how may I assist you today?



llama_print_timings:        load time =     414.63 ms
llama_print_timings:      sample time =       5.44 ms /    19 runs   (    0.29 ms per token,  3490.72 tokens per second)

llama_print_timings: prompt eval time =     799.78 ms /   100 tokens (    8.00 ms per token,   125.04 tokens per second)
llama_print_timings:        load time =     414.63 ms
llama_print_timings:        eval time =     842.40 ms /    18 runs   (   46.80 ms per token,    21.37 tokens per second)
llama_print_timings:      sample time =       5.44 ms /    19 runs   (    0.29 ms per token,  3490.72 tokens per second)
llama_print_timings:       total time =   16391.73 ms /   118 tokens

Login to your HF Dashboard

from huggingface_hub import notebook_login
notebook_login()

Set up the required model path and create HF repo

from huggingface_hub import HfApi, HfFolder, create_repo, upload_file
model_path = "./quantized_model/Q4_K_M.gguf" # Your model's local path
repo_name = "gemma-medical_qa-GGUF"  # Desired HF Hub repository name
repo_url = create_repo(repo_name, private=False)
#

Upload the quantized model to Huggingface

api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf",
    repo_id="Plaban81/gemma-medical_qa-GGUF",
    repo_type="model",
)


##############################################################
Q4_K_M.gguf: 100%
 1.63G/1.63G [01:15<00:00, 22.4MB/s]
CommitInfo(commit_url='https://huggingface.co/Plaban81/gemma-medical_qa-GGUF/commit/811ba25102252c4ab1a5739ad5cc9d06a55a9b82', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='811ba25102252c4ab1a5739ad5cc9d06a55a9b82', pr_url=None, pr_revision=None, pr_num=None)

Download the quantized model for inference

!wget "https://huggingface.co/Plaban81/gemma-medical_qa-GGUF/resolve/main/Q4_K_M.gguf"

Install llama.cpp on GPU

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Use the GGUF model for inferencing using Llama.cpp.

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="/content/Q4_K_M.gguf",  # Download the model file first
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=1,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1         # The number of layers to offload to GPU, if you have GPU acceleration available
)

############################################################################
llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from /content/Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = original_model
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          gemma.block_count u32              = 18
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q4_K:  108 tensors
llama_model_loader: - type q6_K:   19 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 1.51 GiB (5.18 BPW) 
llm_load_print_meta: general.name     = original_model
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 1 '<eos>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors:        CPU buffer size =   410.16 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.98 MiB
........................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   576.00 MiB
llama_new_context_with_model: KV self size  =  576.00 MiB, K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    69.26 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   592.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     4.00 MiB
llama_new_context_with_model: graph splits (measure): 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}", 'tokenizer.ggml.add_eos_token': 'true', 'tokenizer.ggml.padding_token_id': '1', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.bos_token_id': '2', 'general.architecture': 'gemma', 'gemma.feed_forward_length': '16384', 'tokenizer.ggml.add_bos_token': 'true', 'gemma.attention.head_count': '8', 'general.name': 'original_model', 'gemma.context_length': '8192', 'gemma.embedding_length': '2048', 'gemma.block_count': '18', 'gemma.attention.head_count_kv': '1', 'gemma.attention.key_length': '256', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'gemma.attention.layer_norm_rms_epsilon': '0.000001', 'gemma.attention.value_length': '256', 'general.file_type': '15'}
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token: <eos>
Using chat bos_token: <bos>

Query 1

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)
output
######################################################################
{'id': 'cmpl-a53def0a-3c2d-4d09-b5d0-97f09b6fb7d6',
 'object': 'text_completion',
 'created': 1709784839,
 'model': '/content/Q4_K_M.gguf',
 'choices': [{'text': " <start_of_turn>\n<end_of_turn>model<analysis>\n\nThis is a question about diagnosing acute lymphoblastic leukemia (ALL) in an 8-year-old boy who has acute lymphoblastic leukemia (ALL). The key information is that the boy had acute lymphoblastic leukemia before starting chemotherapy 5 days ago. The question asks for which serum study and urine findings will help confirm the diagnosis.\n\nThe choice of tests is important because ALL is a diagnosis that can be missed in children. The key findings in the question stem are hyperkalemia, hyperphosphatemia, hypocalcemia, and hyperuricemia. \n\nA serum study like an ESR, electrolytes, and calcium would help confirm the diagnosis of ALL by confirming the presence of leukemic cells. A urine study like urinalysis would help confirm the diagnosis by detecting increased levels of urinary leukocyte casts or hemoglobin indicating hemoglobinuria. \n\nChoice B (hyperuricemia, hyperkalemia, hyperphosphatemia, heme) refers to acute leukemias and would not help differentiate ALL from other lymphoblastic leukemia diagnoses. Choice C (hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals) would help confirm the diagnosis but does not provide information about leukemic cells themselves. \n\nChoice D (hyperuricemia, hyperkalemia, hyperphosphatemia, urinary monoclonal spike) refers to acute leukemias and would help confirm the diagnosis. However, it does not provide information about leukemic cells themselves. Choice E (hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals) refers to acute leukemias and would help confirm the diagnosis. However, it does not provide information about leukemic cells themselves. \n</analysis>\n<answer>\nD: Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike\n</answer> <end_of_turn>\n roble's analysis is correct. The key findings in the question stem are hyperkalemia, hyperphosphatemia, hypocalcemia, and hyperuricemia. These findings are consistent with acute lymphoblastic leukemia. A urine study like urinalysis would help confirm the diagnosis of acute lymphoblastic leukemia. \n</analysis> <end_of_turn>\nHere are the other choices: \nA: Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM) - this would help confirm the diagnosis of acute lymphoblastic leukemia. \nB: Hyperkalemia, hyperphosphatemia, hypocalcemia",
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 304,
  'completion_tokens': 512,
  'total_tokens': 816}}

Query 2

query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)

output
############################################################################
{'id': 'cmpl-0c9f6a09-42d1-4367-bb3d-d532d87c4dc0',
 'object': 'text_completion',
 'created': 1709785074,
 'model': '/content/Q4_K_M.gguf',
 'choices': [{'text': ' <end_of_turn>\n<end_of_turn>model<analysis>\n\nThis is a question about diagnosing a patient with acute lymphoblastic leukemia (ALL) based on a history and physical examination findings. The key findings are:\n- 8-year-old boy\n- Acute lymphoblastic leukemia diagnosis\n- 1st dose of chemotherapy 5 days ago\n- Leukoctane count 60,000/mm3\n- Vital signs include tachycardia, edema, and hyperuricemia\n\nThe differential diagnosis includes:\n- Uricosuria due to hyperuricemia and elevated creatinine kinase (CK)\n- Uric acid crystals in the urine due to hyperuricemia and elevated creatine kinase (CK)\n- Malic aciduria due to hyperuricemia and elevated creatinine kinase (CK)\n\nThe key tests are:\n- Serum studies should include hyperkalemia, hyperphosphatemia, hypocalcemia, and elevated CK. \n- Urine studies should include a positive heme test for heme.\n\nBased on these tests, the most likely diagnosis is uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. Uric acid crystals in the urine will confirm the diagnosis.\n</analysis>\n<answer>\nE: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals\n</answer> <end_of_turn>\n Reasoning:\nThe question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.\n</analysis> <start_of_turn>\n<answer>\nE: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals\n</answer> <end_of_turn>\n Reasoning:\nThe question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 306,
  'completion_tokens': 449,
  'total_tokens': 755}}

Extracting the answer

print(output["choices"][0]["text"].split("<end_of_turn>\n<end_of_turn>model")[-1])

######################################################################
<analysis>

This is a question about diagnosing a patient with acute lymphoblastic leukemia (ALL) based on a history and physical examination findings. The key findings are:
- 8-year-old boy
- Acute lymphoblastic leukemia diagnosis
- 1st dose of chemotherapy 5 days ago
- Leukoctane count 60,000/mm3
- Vital signs include tachycardia, edema, and hyperuricemia

The differential diagnosis includes:
- Uricosuria due to hyperuricemia and elevated creatinine kinase (CK)
- Uric acid crystals in the urine due to hyperuricemia and elevated creatine kinase (CK)
- Malic aciduria due to hyperuricemia and elevated creatinine kinase (CK)

The key tests are:
- Serum studies should include hyperkalemia, hyperphosphatemia, hypocalcemia, and elevated CK. 
- Urine studies should include a positive heme test for heme.

Based on these tests, the most likely diagnosis is uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. Uric acid crystals in the urine will confirm the diagnosis.
</analysis>
<answer>
E: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals
</answer> <end_of_turn>
 Reasoning:
The question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.
</analysis> <start_of_turn>
<answer>
E: Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals
</answer> <end_of_turn>
 Reasoning:
The question asks for additional tests to confirm the diagnosis of uricosuria caused by hyperuricemia and elevated CK due to acute lymphoblastic leukemia. The tests requested best confirm uricosuria due to hyperuricemia and elevated CK. The positive heme test and urine oxalate crystals confirm the diagnosis.

Conclusion

Here we have instruction fine-tuned Gemma-2b-it model on medical reasoning task. Post fine-tuning we have quantized the fine-tuned model using .GGUF model. We have also downloaded the quantized fine-tuned model and used if for inferencing via llama.cpp package.

References:

AIAnytime/GGUF-Quantization-of-any-LLM · GitHub

connect with me.