Running ORCA LLM Using CPU: A Bad Idea, but Viable Option

When CPUs Are Your Only Resource and Time Is All You Have

Published in

Data And Beyond

4 min readJun 30, 2023

Yes, ORCA has beaten ChatGPT, and they are not joking, read the paper from Microsoft Research here. See the comparison as figure below:

Orca (13B params) outperforms a wide range of foundation models including OpenAI ChatGPT as evaluated by GPT-4 in the Vicuna evaluation set

Explanation tuning with Orca (13B params) bridges gap with OpenAI foundation models like Text-da-Vinci-003 with 5 pts gap (the gap further reduces with optimized system messages) against ChatGPT across a wide range of professional and academic exams including GRE, GMAT, LSAT, SAT from the AGIEval benchmark [1] in zero-shot settings (without any exemplar or CoT)

According to the mentioned paper, Pankaj M developed a customized OpenLLaMa-7B model by training it on carefully curated datasets. These datasets were created by combining instructions and inputs from WizardLM, Alpaca, and Dolly-V2 datasets, while incorporating the dataset construction techniques described in the Orca Research Paper. Notably, He utilized all 15 system instructions provided in the Orca Research Paper to generate these custom datasets, which differs from the conventional instruction tuning methods employed in the original datasets.

Unfortunately, the example given in his Huggingface explanation page rather vague to decipher. A successful duplication has been made and explained beautifully in following video:

While the video run the ORCA LLM using GPU, an alternative using CPU, though it is a bad idea, and you found yourself in a situation where you’re eager to try out ORCA LLM, but all you have at your disposal is a CPU? This article might just for you:

Code Explanation:

Let’s walk through the code and understand each part:

!pip install -q torch
!pip install -q langflow
!pip install -q accelerate
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118


import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

# Hugging Face model_path
model_path = 'psmathur/orca_mini_7b'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float32, offload_folder="offload"
)

I run this code on a laptop using CPU Intel Core (TM) i3 gen 10 @2.1 GHz with 8 Gb of RAMS. Please note that to run this script you will need at least 30GB of Disk space. The model itself is at least 24GB.

In this section, we import the necessary libraries: torch for PyTorch and LlamaForCausalLM and LlamaTokenizer from the Hugging Face transformers library. We specify the model path for ORCA LLM and initialize the tokenizer and model using the from_pretrained method. We set the torch_dtype to torch.float32 to use the CPU-friendly data type. The offload_folder parameter specifies the folder for offloading model files.

# Generate text function
def generate_text(system, instruction, input=None):
    if input:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    else:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:\n"

    tokens = tokenizer.encode(prompt)
    tokens = torch.LongTensor(tokens).unsqueeze(0)

    instance = {'input_ids': tokens, 'top_p': 1.0, 'temperature': 0.7, 'generate_len': 1024, 'top_k': 50}

    length = len(tokens[0])
    with torch.no_grad():
        rest = model.generate(
            input_ids=tokens,
            max_length=length + instance['generate_len'],
            use_cache=True,
            do_sample=True,
            top_p=instance['top_p'],
            temperature=instance['temperature'],
            top_k=instance['top_k']
        )
    output = rest[0][length:]
    string = tokenizer.decode(output, skip_special_tokens=True)
    return f'[!] Response: {string}'

Next, we define the generate_text function, which takes in the system prompt, user instruction, and an optional input. It generates a text response using ORCA LLM. The function encodes the prompt using the tokenizer, converts the tokens to a PyTorch tensor, and sets up the generation parameters. The model generates the response using the specified parameters, and the output is decoded back into human-readable text.

system = 'You are an AI assistant that follows instructions extremely well. Help as much as you can.'

def main():
    print("Ready!")
    while True:
        user_input = input("User: ")
        if user_input == "exit":
            break
        res = generate_text(system, user_input)
        print("Orca:", res)

if __name__ == "__main__":
    main()

Finally, we define the main function, which serves as the entry point of the program. It prompts the user for input, generates a response using the generate_text function, and prints the result. The loop continues until the user enters "exit".

Conclusion:

In conclusion, running ORCA LLM using CPU is indeed slower compared to GPU-accelerated setups. However, when CPUs are the only available resource and time is abundant, it remains a viable option for exploring ORCA LLM’s capabilities. While the response times may not be as snappy, this approach can still produce meaningful results. So, if you find yourself in a similar situation, give it a try and unlock the potential of ORCA LLM with the resources you have!

A snapshot how it tortures my cpu presented below:

Complete Code:

Here’s the complete code for running ORCA LLM on CPU:

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

# Hugging Face model_path
model_path = 'psmathur/orca_mini_7b'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float32, offload_folder="offload"
)

# Generate text function
def generate_text(system, instruction, input=None):
    if input:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    else:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:\n"

    tokens = tokenizer.encode(prompt)
    tokens = torch.LongTensor(tokens).unsqueeze(0)

    instance = {'input_ids': tokens, 'top_p': 1.0, 'temperature': 0.7, 'generate_len': 1024, 'top_k': 50}

    length = len(tokens[0])
    with torch.no_grad():
        rest = model.generate(
            input_ids=tokens,
            max_length=length + instance['generate_len'],
            use_cache=True,
            do_sample=True,
            top_p=instance['top_p'],
            temperature=instance['temperature'],
            top_k=instance['top_k']
        )
    output = rest[0][length:]
    string = tokenizer.decode(output, skip_special_tokens=True)
    return f'[!] Response: {string}'

system = 'You are an AI assistant that follows instructions extremely well. Help as much as you can.'

def main():
    print("Ready!")
    while True:
        user_input = input("User: ")
        if user_input == "exit":
            break
        res = generate_text(system, user_input)
        print("Orca:", res)

if __name__ == "__main__":
    main()

Running ORCA LLM Using CPU: A Bad Idea, but Viable Option

When CPUs Are Your Only Resource and Time Is All You Have

Written by bedy kharisma