Photo by Brett Jordan on Unsplash

GenAI Essentials (Part 1)

Large Language Models with Camel 5B and Open LLaMa 3B v2

Benjamin Consolvo
4 min readOct 15, 2023

--

Large Language Models (LLM) have taken the world by storm this past year with chatbots, code generation and debugging, retrieval augmented generation, instruction-following, and many more applications. In this article, I demonstrate LLM inference on the latest Intel Data Center GPU Max 1100 using two models:

  1. Camel 5B: Derived from the Palmyra-Base architecture, Camel 5B is a 5-billion parameter LLM, trained on 70K instruction-response records. It differentiates itself from other LLMs in being able to take complex instructions and generate contextually accurate responses.
  2. Open LLaMA 3B v2: This is an open-source, 3-billion parameter reproduction of Meta’s LLaMA model, trained on a variety of data. Its biggest advantage is that it builds on the success of Meta’s LLaMA 2 and is permissively licensed for broader consumption.

Note that these particular models were not fine-tuned for chat, so your mileage may vary in terms of their responses.

I used the Intel Data Center GPU Max 1100 (56 Xe cores, 48 GB memory, 300 Watts TDP) for my inference tests. From the command line, I can verify that I do indeed have these GPUs:

clinfo -l

The output from this command shows that my system has four Intel GPUs:

Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
+-- Device #1: Intel(R) Data Center GPU Max 1100
+-- Device #2: Intel(R) Data Center GPU Max 1100
`-- Device #3: Intel(R) Data Center GPU Max 1100

Similar to the nvidia-smi function, you can run xpu-smi to get GPU usage statistics:

xpu-smi dump -d 0 -m 0,5,18

The command reports GPU utilization for device 0 every second:

getpwuid error: Success
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
13:34:51.000, 0, 0.02, 0.05, 28.75
13:34:52.000, 0, 0.00, 0.05, 28.75
13:34:53.000, 0, 0.00, 0.05, 28.75
13:34:54.000, 0, 0.00, 0.05, 28.75

A Jupyter notebook is hosted on the Intel Developer Cloud so you can run the LLM examples yourself. It gives you the option of using either of the models listed above. Here’s how to get started:

  1. Register for an Intel Developer Cloud account as a Standard user.
  2. Log into your account and head over to the Training and Workshops section.
  3. Click on the Gen AI Essentials “Launch” option below the “Simple LLM Inference: Playing with Language Models” (Figure 2).
Figure 2: Try out the GenAI Essentials Training on the Intel Developer Cloud. Image by author.

Figure 3 shows the user interface within the LLM notebook. You have the option of selecting a model, interacting with or without context, and changing the parameters:

  • Temperature: The temperature for controlling randomness in a Boltzmann distribution. Higher values increase randomness, while lower values make the generation more deterministic.
  • Top P: The cumulative distribution function threshold for nucleus sampling. This helps in controlling the trade-off between randomness and diversity.
  • Top K: The number of highest probability vocabulary tokens to keep for top-k filtering.
  • Num Beams: The number of beams for beam search. This controls the breadth of the search.
  • Rep Penalty: The repetition penalty applied for repeating tokens.
Figure 3. A mini user interface within the Jupyter notebook provides a text prompt and in-line responses.

The Intel Extension for PyTorch was used to speed up inference on the Intel GPU. Two of the key functions are

ipex.optimize_transformers(self.model, dtype=self.torch_dtype)

and

ipex.optimize(self.model, dtype=self.torch_dtype)

where self.model is the loaded LLM model and self.torch_dtype is the data type, which should be torch.bfloat16 to boost performance on the Intel GPU.

I was able to generate responses with these models within seconds after the model was loaded into memory. As I mentioned above, these models are not fine-tuned for chat so your mileage may vary in terms of the responses.

You can reach me on the Intel DevHub Discord server here (user name bconsolvo), LinkedIn here, or Twitter here. Thank you for reading. Happy coding!

Disclaimer for Using Large Language Models

Please be aware that while LLMs like Camel-5B and Open LLaMA 3b v2 are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It’s advisable to carefully review the generated text and consider the context and application in which you are using these models.

Usage of these models must also adhere to their licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.

--

--

Benjamin Consolvo

AI Software Engineering Manager at Intel. I like to write on topics in AI to help other developers along their coding journey.