Generated from the prompt “zephyr bird in a future tech world with blue hues” by Stable Diffusion XL (https://huggingface.co/spaces/google/sdxl)

GenAI Essentials

Inference with Falcon-7B and Zephyr-7B on the Intel Developer Cloud

Published in

Intel Analytics Software

5 min readDec 4, 2023

It is no secret that open-source large language models (LLMs) like Falcon-7B and Zephyr-7B have made building your own conversational AI system highly performant and efficient on smaller hardware platforms, while also being accessible to the broader AI developer community. The “7B” tag indicates that these are 7-billion parameter models, which are on the smaller end of LLMs when compared to 13-billion and 70-billion+ parameter models. Right now, GPT-4 outperforms these smaller models, but I believe that in 2024 we will see the gap closing between the small open-source models and the large closed-source models. In this article, I briefly introduce the two aforementioned 7B models and how to get started by using them on the latest Intel GPU.

Falcon-7B

Falcon-7B was built by Technology Innovation Institute and is a raw pretrained model. On its own, it is not directly suitable for chat. But after fine-tuning it to your dataset, it is a highly capable LLM suitable for chat, text classification, question answering, and other text generation tasks. It is an open-source model that uses the Apache License (v2.0), making it publicly accessible for academic and commercial use. It was trained on a dataset called RefinedWeb for English and French. It is a whopping 2.8TB-sized, web-only dataset consisting of 968M individual web pages, filtered from a large-scale dataset from CommonCrawl. It was also used to train Falcon-40B, a more powerful “big brother” of Falcon-7B. For more information on the training data, check out the documentation at Hugging Face.

Zephyr-7B

Zephyr-7B β, a model fine-tuned from Mistral-7B, was built by Hugging Face and distributed under the MIT License. Unlike Falcon-7B, Zephyr-7B is already fine-tuned for chat and is ready out of the box for a chat application. Zephyr-7B was fine-tuned using the UltraChat and the UltraFeedback datasets, which are described in more detail below.

UltraChat

The UltraChat dataset is actually a synthetic dataset of dialogues generated by ChatGPT, e.g.:

How do you think innovation differs between large and small companies?
As an AI language model, I don’t have personal views or opinions — but I can provide information based on my training data. Innovation can differ between large and small companies in several ways: Resources: Large companies often have more resources available to invest in research and development, experimentation, and prototyping. On the other hand, small companies may have limited resources and must prioritize innovation efforts more carefully. …

This is only a snippet of the dialogue, as each training example is quite lengthy. There are 774K of such dialogue examples spanning a wide variety of topics. As pointed out in the dataset description, the dialogue includes “rewriting, continuation, summarization, and inference.”

UltraFeedback

The UltraFeedback dataset is a collection of 64K prompts from a wide range of models including GPT-3.5 Turbo, MPT-30B-Chat, Alpaca 7B, Pythia-12B, StarChat, and others. Four different responses are generated for each prompt, which means that there are a total of 256K samples. GPT-4 is then used to annotate the collected samples.

Getting Started on the Intel Developer Cloud

You can get started for free with a Jupyter Notebook hosted on the Intel Developer Cloud so you can run the LLM examples yourself using the latest Intel AI hardware together with Intel-optimized AI software. It gives you the option of using the models listed above. I just added these two models to the existing Simple LLM Inference notebook so that you can get started right away with these. Just click on the Launch button under “Simple LLM Inference: Playing with Language Models” on the home page to open up the Jupyter Notebook to get started:

Launching the LLM Inference Jupyter Notebook on the Intel Developer Cloud home-page (image by author)

Notes on the Code

All of the required Python frameworks come preinstalled on the Intel Developer Cloud, including transformers, torch, and intel_extension_for_pytorch. The Zephyr-7B and Falcon 7-B models are called with the usual transformers framework:

from transformers import AutoModelForCausalLM, AutoTokenizer

Here is where the actual tokenizer and model is instantiated:

self.tokenizer = AutoTokenizer.from_pretrained(
                    model_id_or_path, 
                    trust_remote_code = True, 
                    cache_dir = "/home/common/data/Big_Data/GenAI/"
                 )

self.model = (
                 AutoModelForCausalLM.from_pretrained(
                     model_id_or_path,
                     low_cpu_mem_usage = True,
                     trust_remote_code = True,
                     torch_dtype = torch.bfloat16,
                     cache_dir = "/home/common/data/Big_Data/GenAI/",
                  )
                  .to(self.device)
                  .eval()
             )

In order to get the most of out of the latest Intel Data Center Series GPU Max 1100, both PyTorch and Intel Extension for PyTorch come preinstalled in the conda environment pytorch-gpu that is loaded with the notebook. You can visit the GitHub links provided to install these on your own instances, if needed.

The two key functions that are used with Intel Extension for PyTorch are:

ipex.optimize_transformers(self.model, dtype = torch.bfloat16)

and

ipex.optimize(self.model, dtype = torch.bfloat16)

where self.model is the loaded LLM and torch.bfloat16 is the smaller data type used to boost performance on the Intel GPU. The nice thing about this extension is that there is very little code modification that you would need to do when coming from another platform. Changing the device to xpu and these small code changes are really all you should need.

Summary

Falcon-7B and Zephyr-7B are smaller LLMs when compared to their much larger model equivalents (e.g., Falcon-180B), but deliver performant and efficient inference. Falcon-7B is an example of a model that can be fine-tuned for many text tasks including chat, text classification, and question answering. Zephyr-7B was already fine-tuned from another model called Mistral-7B, and it works great out of the box for chat. Both of these models can be used on the Intel Developer Cloud with the provided sample Jupyter Notebook by clicking on “Simple LLM Inference: Playing with Language Models” on the home page after registering. You are welcome to try these models as well as bring your own models from Hugging Face. I look forward to hearing about your experience with these models on the Intel Developer Cloud.

You can reach me on the Intel DevHub Discord server here, LinkedIn here, or Twitter here. Thank you for reading.

Disclaimer for Using Large Language Models

Please be aware that while LLMs like Falcon-7B and Zephyr-7B are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It’s advisable to carefully review the generated text and consider the context and application in which you are using these models.

Usage of these models must also adhere to their licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.