Empower Applications with Optimized LLMs: Performance, Cost, and Beyond
Impact of using optimized Databricks’ Dolly 2.0 on 4th Gen Intel® Xeon® Scalable processors
Authors: Ezequiel Lanza, Imtiaz Sajwani
Integrating Large Language Models (LLMs) into apps is a well-established trend, as companies progressively acknowledge their worth. However, determining the most suitable model for specific cases and identifying the optimal balance between performance and cost-effectiveness can pose a challenge.
Examples abound, such as OpenAI* OpenAI*, ChatGPT*, Bing*’s GPT4*, Claude 2*, Bard/PaLM2*, Dolly*, and Meta* Llama v2, all of which offer a range of options ranging from paid APIs to open source alternatives. However, the pursuit goes beyond just finding the most accurate model. Other factors, such as throughput and latency, also have a significant impact on the overall cost of the solution.
In this post, we’ll highlight the significance of adopting optimized LLM models. Specifically, we’ll delve into the latest open source addition, Dolly 3B, and how Intel® Extension for Transformers* can help with the optimization, followed by the remarkable performance results achieved by the optimized Dolly model on the Platform based on 4th Gen Intel® Xeon® Scalable processors , which have built-in accelerators like Intel® Advanced Matrix Extensions (Intel® AMX).
How it can be optimized?
Dolly (like LLM models) is based on Transformer architecture, which is known to require substantial computational resources due to their large number of parameters. However, those LLMs which have fewer parameters, can offer similar performance in customer service-focused companies, and their training and computational requirements are less intensive.
There are multiple ways to optimize a model (read this post to see how a BERT model was optimized), and in this case, we’ll be using Intel® Extension for Transformers* This toolkit improves the performance of transformer-based models. Utilizing Intel platforms, especially in 4th Gen Intel® Xeon® Scalable processors and Habana® Gaudi®2 and Habana® Greco™, can provide additional advantages. The toolkit offers a full pipeline for a Transformer-based model, with a particular focus on the LLM architecture, as illustrated in Figure 1. As an example, users have the capability to conduct fine-tuning options within Intel® Extension for Transformers*, including LORA, P-tuning, RLHF, and more. Intel® Extension for Transformers* also offers advanced quantization techniques like SmoothQuant, GPTQ (4bits), and 4bits C++ inference.
How to measure LLMs performance
As any workload must be measured, generative AI has its own specific way of measuring.
When interacting with LLM models, which can be through a chatbot or a Q&A system, the process involves the generation of text based on an initial prompt. To grasp the mechanics behind that, you can think of LLMs as word weavers. They craft sentences sequentially (one word at a time), much like constructing a sentence by adding one thread after another. The selection of each new word relies on the preceding words, ensuring a coherent and logical progression of the phrase. This process involves two main components, as shown in fig 2: the model (Dolly in this case) and a decoder (Beam Search) who will help to generate the next word based on the model’s preceding generated words (you can refer to this post to see the details). To evaluate the model’s efficacy, its performance in most LLM applications is assessed through token latency.
Typically, there are two aspects of latency measurement: “1st token” and “2nd token.” The latency of the 1st token signifies the duration the LLM model takes to generate the INITIAL token following receipt of a user prompt. Once the 1st token is generated, the time taken to provide the 2nd token is termed the “2nd token latency.” The evaluation of token 1 and token 2 latencies plays a crucial role in establishing a positive user experience, maintaining the flow of conversation, enabling interactivity, and enhancing the effectiveness and engagement of language models across a wide array of applications.
Another significant metric to consider is throughput, measured in tokens per second, which indicates the rate at which tokens can be generated.
These values can provide insight into the model’s performance. Additionally, an equally important factor is the hardware on which the model is executed. Beyond optimizing the model, itself, applications can also leverage hardware accelerations built-in 4th Gen Intel® Xeon® Scalable processors.
In testing the performance of this scenario (Fig 3), we conducted assessments on GCP-C3 instances, which come with integrated AI acceleration (Intel® Advanced Matrix Extensions (Intel® AMX). This integration aims to make the adoption of LLMs more accessible, meeting the throughput token/second demands for various use cases. The optimized Intel software ecosystem makes a C3 instance an ideal environment use case for langsmith + LangChain and prompt applications and/or fine-tuning the LLM for required tasks.
c3-highmem-44 : (Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz), architecture x86_64, microarchitecture SPR, Family 6, Model 143, Stepping: 8, Base Frequency: 2.0GHz, Maximum Frequency:2000MHz, CPUs: 44, On-line CPU list : 0–43, Cores per socket : 22, Socket:1, NUMA Nodes : 1, Numa CPU list: 0–43, Kernel 5.15.0–1031-gcp , Microcode : 0xffffffff , Ubuntu 22.04.2LTS, L1d Cache : 1Mib, L1d Cache : 705 KiB, L2 Cache : 44 MiB , L3 Cache : 105 MiB, Memory channels : 8, Intel turbo boost : Disabled, Max C-state: 9, Installed memory : 352GB (22x16GB RAM ), Buffers : 40432 kB , Cached : 778280 kB, Hugepagesize : 2048 kB, Transparent Huge Pages : madvise , Automatic NUMA Balancing: Disabled, Network : Compute Engine Virtual Ethernet [gVNIC], Speed NIC: 32000Mb/s,
OneDNN: v3.1,ITREX : v1.1, PyTorch : v.2.0 ,IPEX : NA, Run Method: python run_llm.py — max-new-tokens 32 — input-tokens 32 — batch-size 1 — model_path <path to bf16 engine model> — model <model name> , CPU utilization: ~99% (max QPS use case), test by Intel on
Using software optimization tools in generative AI, such as language models, is important for enhancing performance. These tools streamline code, resulting in faster execution, reduced latency, and efficient resource utilization. They ensure scalability to handle increased workloads without proportional hardware upgrades, improving stability and reliability. Optimization tailors AI performance to specific needs and adapt to evolving hardware. In the context of hardware acceleration, optimization maximizes hardware capabilities. Overall, these tools are essential for achieving optimal responsiveness, efficiency, and adaptability in generative AI applications.
Call to action
- You can also use Intel® Developer Cloud where developers can test their software examples and models from anywhere in the world, as well as testing before moving into production. Access the latest Intel® Xeon® processor, Intel® Data Center GPU, Intel® FPGAs and software. Go to Intel® Developer Cloud to learn more and sign up.
- Take a moment to visit the Intel® Extension for Transformers*.website and test its capabilities. It’s an excellent opportunity to see how it can optimize your models effectively. Check it out today and share your feedback.
The authors thank Hanwen Chang and Jun Lin for their contributions.
About the authors
Ezequiel Lanza is an open source evangelist on Intel’s Open Ecosystem Team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza
Imtiaz Sajwani is a Cloud AI/ML Software Architect at Intel with expertise in Generative AI/ML, embedded systems and design experience in Application-Specific Integrated Circuits (ASIC), Graphic Processing Unit (GPU), Field Programmable Gate Arrays (FPGA) and platform software solution.
Notices & Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.