Harnessing the Power of AI Processing: A Test Drive with the Gaudi2 Processor

Tan Tiong Kai
HTX S&S COE
Published in
14 min readFeb 22, 2024

The growth of AI development has been exponential over the past 10 years. The capabilities of AI, particularly in the domains of image and speech recognition, language understanding, reading and comprehension have grown in leaps and bounds, even surpassing human performance in some areas (shown in the graph below).

[Image credit: Our World in Data, Artificial Intelligence by Charlie Giattino, Edouard Mathieu, Veronika Samborska and Max Roser]

These developments have been possible with increased complexities and capacities of machine and deep learning models. The more complex AI models require a huge number of parameters to be trained. Training these models involves performing numerous mathematical operations, such as matrix multiplications and derivatives calculations, which can be computationally intensive. Despite the roots of neural networks reaching back to the 1980s, and the exploration of deep learning frameworks in the 2000s, lack of computational power had been one of the key hinderance to AI performance. However, with the evolution of computer processing capabilities, particularly with the Graphics Processing Unit (GPU), it has paved the way for AI’s exponential growth.

The Evolution of Specialised Processors: From Graphics to AI

The Graphics Processing Unit (GPU), initially developed for graphics and image processing, was the game-changer in the development of AI. Its ability to perform and process multiple computations simultaneously has been a cornerstone in the acceleration of AI tasks. The continuous improvement and development of these specialised processors not only addressed the hurdles of complex computations but also propelled the field of Deep Learning forward.

In the fast-changing landscape of AI, state-of-the-art models boasting colossal sizes, often contain billions of parameters. To harness the capabilities of these models, organisations must deploy hardware that can keep up with the computational intensity they demand. Among the plethora of accelerators available, the dominant players are the GPU, Tensor Processing Unit (TPU), and the Habana Processing Unit (HPU) developed by Nvidia, Google, and Intel/Habana, respectively.

Our team at the Sensemaking & Surveillance Centre of Expertise (S&S CoE), adopts a hands on & open approach in exploring new technologies. Whilst Nvidia’s GPU currently dominates the hardware acceleration market, the team wanted to conduct an independent evaluation of the AI processors’ market and their capabilities. This led to our acquisition of the latest GAUDI2 from Intel/Habana, so that we can conduct our own benchmarking experiments. The Gaudi2 is Habana’s 2nd generation deep learning processor which has been purported to increase training performance and making it easy to scale out training capacity for users.

Compared to the Gaudi2, which has 8 cores each with 100GB HPU ram, the tools that we currently use include:

1. Nvidia RTX Quadro 8000 (50GB GPU ram)

2. AWS EC2 P3 instance (1xTesla V100, 16GB GPU ram)

We had 2 main objectives in this benchmarking experiment — (a) to determine the ease of use of Gaudi2; and (b) to benchmark its performance against what we have today. We will also discuss Gaudi2’s potential in harnessing the power of large models — a recent development that has garnered a lot of hype.

A. Benchmarking: Ease of Use

In this exercise, quantifying ease of use will be centred around the number of modifications required for project execution. This includes tasks such as setting up the project environment, installing necessary packages, and executing codes. The ideal scenario would be a seamless transition from a CUDA space to the Gaudi2 without significant code alterations.

The benchmarking repertoire included both in-house codes and open-source projects of interest sourced from GitHub. The focus on open-source codes aligns with the thinking behind S&S CoE’s work, where our projects involve researching and implementing novel ideas and models that are derived from the broader AI community.

To evaluate Gaudi2’s ease of use, we focused on 2 factors:

1. Availability of documentation across the whole process of setting up, to running our models.

2. The frequency and ease of debugging.

Availability of documentation

We found that Habana does provide comprehensive documentation [1] to use Gaudi2. Users are first required to install the SynapseAI Software Stack before installing the native supported frameworks (TensorFlow and PyTorch) which can be found here. Clear documentation facilitated the setup process, from installing the SynapseAI Software Stack to configuring the driver. Following the Bare Metal Fresh OS installation guide proved to be straightforward, allowing the team to establish the Gaudi2 environment with ease. Subsequent setups of PyTorch environments were achieved using the provided documentation, allowing us to get our models to work easily and quickly.

The Habana’s developer site also provides a list of models that has been tried and tested by the Habana team. The list has a comprehensive set of models ranging from Large Language Models (LMMs) to Diffusion models.

However, as we experimented with different projects, using open-source codes or newer models, there were occurrences of issues that cannot be resolved simply by referring to their documentation. This will be covered in the next section.

Frequency and Ease of Debugging

1. Issue of porting over from GPU/CUDA to Gaudi2

Facilitating the ease of transition from GPU to HPU is still an ongoing effort by the Habana team. There are multiple challenges in achieving complete seamlessness. While simple and mature models can be directly ported from the CUDA environment to Gaudi2, complications arose with heavily coupled codes that are reliant on GPU/CUDA libraries. The integration of libraries specifically designed for GPU/CUDA poses a significant hurdle, making it a challenge to ensure a completely seamless transition from GPU to HPU. (Sample screenshot below)

Example of one of the GPU/CUDA Errors. [Image credit: HTX S&S CoE]

Case in Point: bitsandbytes. The library “Bitsandbytes” serves as a lightweight wrapper around custom functions in CUDA, specifically designed for 8-bit optimizers and matrix multiplication. In executing a project that relied on this library, we encountered significant difficulties when attempting to run it on the Gaudi2 processor. The inability to seamlessly integrate Bitsandbytes to the Gaudi2 underscores the intricate nature of transitioning certain CUDA-centric codebases to this new processing environment. It further emphasizes the need to address compatibility issues for the option of utilizing open-source and existing libraries on the Gaudi2.

2. Version conflicts from libraries

Within Habana’s documentation, PyTorch and related libraries are always installed with a fixed version. In this case, users face limited flexibility in selecting versions, which leads to conflicts with other libraries. The fixed Python version (3.10) on the Gaudi2 contributes further to compatibility challenges with projects requiring older Python versions. To address these issues, the team explored an empirical workaround which involved removing all version requirements during library installation to allow these issues to resolve themselves.

Whilst this approach offered a stop-gap measure, its empirical nature introduced uncertainties and iterative experimentation, which leaves more room for further improvement in future developments of the Gaudi2 environments.

3. Unable to perform complex number operations

Another issue that surfaced when attempting complex number operations on the Gaudi2, is that Gaudi2 does not support complex datatype. (Screenshot below) This is integral to calculations involving denoising, such as tasks related to image and audio enhancement. This limitation highlights the importance of understanding the capacity and constraints of the hardware required to support different aspects of AI development.

Gaudi2 error message on complex datatype [Image credit: S&S CoE]

B. Benchmarking: Time

For this experiment, the metrics for performance benchmarking cover the speed which models take to run to completion. The Gaudi2 is expected not only to perform faster, but also to yield highly accurate models.

The team opted not to perform any optimisation or parallel processing across the Gaudi2’s cores. The decision to run the code as is, basing on a deliberate strategy to gauge the processor’s performance under standard conditions.

While the models mentioned in this article provide a snapshot of the experiment, the team also tested with a more extensive array of models. The reported results capture the conclusive findings from this comprehensive exploration.

The tasks and models employed in this evaluation are as follows:

1. Image Classification: EfficientNet and Vision Transformer [2,3]

2. Image Classification and Generation: Diffusion model (UNET with conditional DDPM) [4]

3. Image Enhancement: ZeroDCE++ [5]

Image Classification

For our benchmarking purposes, we experimented with the EfficientNet and Vision Transformer (ViT) models. EfficientNet is a family of neural network architectures designed for efficient scaling in computer vision tasks. ViT, on the other hand, applies the transformer architecture to images, allowing it to capture long range dependencies and context for vision tasks.

The specific EfficientNet model that we tested is the efficientnetv2_s model, with 21.5M parameters within the model. We ran this model against the Quadro 8000 (GPU), V100 (GPU) and Gaudi2 (HPU) for 100 epochs. The results as shown below:

Performance of the hardware accelerators on the EfficientNet model [Image credits: HTX S&S COE]

In this comparison, the V100 ran marginally faster than the HPU, with the Quadro 8000 running the slowest.

Next, we compared the performance of the accelerators with the ViT models. We used 2 ViT models to run this phase of experiment, one with a smaller model with 22.2M parameters for 100 epochs to benchmark our Quadro 8000, and another with a larger ViT model with 86.9M parameters for 20 epochs to benchmark the V100 and HPU.

Comparing the HPU with our RTX Quadro 8000 GPU, the HPU was able to outperform the training speed of the RTX Quadro 8000 by 4x (see table below). However, this is expected, since the server-grade HPU is likely to outperform the RTX Quadro 8000, which is a GPU for a desktop workstation.

Performance of the hardware accelerators on the ViT (22.2M parameters) model [Image credits: HTX S&S COE]

In comparing the HPU with the V100 GPU, we can now truly appreciate the speed and performance of the HPU. With the larger ViT model, the Gaudi2 HPU was able to complete the training more than 2x the speed of the V100 GPU (see table below).

Performance of the hardware accelerators on the ViT (86.9M parameters) model [Image credits: HTX S&S COE]

Image Classification and Generation

For this exercise, we used a Denoising Diffusion Probabilistic Model (DDPM). The DDPM is a generative model used in machine learning to generate realistic samples, particularly for high-dimensional data like images. It employs a diffusion process, transforming a simple distribution through denoising steps to model the target distribution. During training, parameters for denoising operations are learned by minimizing the divergence between generated samples and the true data distribution. DDPMs excel in producing high-fidelity, diverse samples, making them valuable for tasks like image synthesis and data augmentation. They offer an alternative approach to traditional generative models, capturing complex dependencies in the data through iterative denoising. The DDPM we trained had a total of 22.6M parameters and was trained for 100 epochs. Results as shown below:

Performance of the hardware accelerators on the DDPM model [Image credits: HTX S&S COE]

Using this model, we can see that the HPU ran about 20% slower than the V100, but it was still almost 5x faster than our RTX Quadro 8000 GPU.

Image Enhancement

The Zero-Reference Deep Curve Estimation (Zero-DCE) is an image enhancement model that improves image quality without relying on reference images. Unlike traditional methods, Zero-DCE focuses on estimating and manipulating the illumination and reflectance components of an image. The illumination component represents global lighting conditions, while the reflectance component captures intrinsic scene properties. Utilizing deep learning, such as convolutional neural networks (CNN), Zero-DCE learns to enhance images during training without the need for explicit reference images. It has been applied to tasks like low-light image enhancement, addressing challenges in visibility under challenging lighting conditions.

This model is different from the previous 4, as it was developed using TensorFlow, with the previous models using PyTorch. With only 10,569 parameters, this ZeroDCE++ model was run for 500 epochs. Results as shown below:

Performance of the hardware accelerators on the ZeroDCE++ model [Image credits: HTX S&S COE]

Looking at the results, we can once again appreciate the performance of the Gaudi2 HPU. The HPU was able to complete training the fastest once again, finishing 2x faster than the V100.

The table & graph below show a summary of the results based on the training runs that we conducted with the 3 hardware accelerators on the various tasks.

Performance of the Gaudi2 when training with the models we selected [Image credit: HTX S&S CoE]
Summary of the time taken for training [Image credits: HTX S&S COE]

The results of our tests were satisfactory. Without any optimization, the Gaudi2 was able to almost match the V100. Some models also outperformed the V100 GPUs, such as our image enhancement model and the vision transformer model, showing the potential of Gaudi2 capabilities.

We have also shared our findings with the Intel/Habana team, including the issues we face and the performance of the Gaudi2, where they are actively looking into our feedback and making improvements. As such, we expect these issues to be solved quickly.

Training with Large Models

In recent years, the landscape of AI has been reshaped by the emergence of groundbreaking models, particularly in the realms of natural language processing and multimodal tasks. Two categories that stand out are the “Large Language Models” (LLM) and “Large Multimodal Models.”(LMM) [6].

In the ever-evolving field of natural language processing, Large Language Models have become the torchbearers of innovation. At the forefront of this revolution is GPT-3, or Generative Pre-trained Transformer 3, a monumental creation by OpenAI. Boasting a staggering 175 billion parameters, GPT-3 has redefined what’s achievable in language understanding and generation.

These LLMs are built on transformer architectures, which enable them to capture intricate patterns and nuances in language. Pre-trained on humongous amounts of text data, they demonstrate unparalleled prowess in a myriad of tasks. From language generation and translation to summarization and question-answering, these models exhibit a depth of linguistic understanding that was once considered elusive.

Today, Large Multimodal Models have taken centre stage, seamlessly integrating multiple data modalities, including text, images, and potentially audio. Imagine a model that not only processes textual descriptions but also interprets images with finesse. This is the essence of multimodal marvels — versatile giants that can comprehend and generate content across diverse domains.

These models shine in tasks like image captioning and visual question answering. By simultaneously handling different types of data, they pave the way for more intricate applications, unravelling the potential of AI to understand and generate contextually relevant outputs in a multimodal landscape.

We have been keeping up with the development and capabilities of these large models, with plans to integrate them into our arsenal of computer vision tools. However, these large models require high computational resources for finetuning or training. Fortunately, the Gaudi2 now has the capability to finetune LLMs/LMMs! The comprehensive list of models available for training can be found here.

Training with Visual Text Dual Encoders

A visual-text dual encoder is a type of neural network architecture designed to process both images and text simultaneously. It involves two encoders, one for images and one for text, which encode the respective inputs into a shared space where they can be compared.

OpenAI’s CLIP (Contrastive Language-Image Pre-training) is an example of a visual-text dual encoder model. CLIP is trained to understand images and text in a unified manner, allowing it to perform tasks like image classification or generating textual descriptions for images. By learning a joint representation space for images and text, CLIP enables versatile cross-modal tasks, where the model can relate concepts in images to those in text and vice versa. This makes CLIP a powerful tool for various applications, including natural language understanding and computer vision tasks.

Using the Gaudi2 HPU, we successfully finetuned a Visual Text Dual Encoder with the COCO 2017 dataset, using this guide. Instead of installing our PyTorch framework manually, we built an environment using the Docker Images that Habana provided. By running our codes using Docker containers, we found that we had less errors and were able to smoothly set up our project environment.

While we have not performed in-depth analysis and benchmarking on HPU’s ability to finetune large models, this progress shows great promise for Gaudi2. In a market dominated by a few major players, it is refreshing to have other alternatives that we can leverage to meet our computational needs. This will allow the hardware accelerator market to improve and remain competitive.

Conclusion

The prospect and capabilities of the Gaudi2 present an exciting trajectory for AI processing. It is evident that the Habana team has been working hard to provide users with an alternative hardware accelerator, that can compete toe to toe with the others. In terms of documentation, Habana has come through with the availability and accessibility of information and instruction to set up and run models. In the area of Debugging, there are problems using the Gaudi2 such as Library version conflicts and errors when running it.

Whilst the process of running established and matured models is relatively straightforward, there are issues if models are new or if project files are heavily coupled with Nvidia’s GPU and CUDA.

Nevertheless, a pivotal capability has emerged — the Gaudi2’s proficiency in fine-tuning, running, and potentially training large models with billions of parameters. As the field of AI shifts toward larger and more complex models, accelerators like the Gaudi2 will be imperative to perform calculations at a scale that commensurate with these models’ magnitudes.

This crucial capability remains an area that is actively monitored and experimented with by the S&S CoE team. We will continue with experiments that focus on large model training, inference, and fine-tuning, so as to gain a comprehensive understanding of the Gaudi2’s capabilities in handling such demanding workloads.

As more organizations adopt the Gaudi2, and a larger user community emerges, the Gaudi2 will have to undergo further refinements and evolution, hopeful that eventually it will mature into a more comprehensive product.

In conclusion, the Gaudi2 represents a significant step towards the future of AI processing. While challenges persist, they are integral to the iterative process of refinement and improvement. The commitment to experimentation, optimization, and adaptation remains paramount in harnessing the true potential of cutting-edge AI processors like the Gaudi2.

Finally, our team at S&S CoE, would like to give our heartfelt appreciation to the Intel/Habana team for allowing us to experiment with the Gaudi2 and providing support and assistance throughout this experimentation process. The Intel/Habana team actively sought our feedback and comments, and offered us technical advice and support whenever necessary.

What are we doing in HTX S&S COE?

If you’ve been following our articles, you will notice that we thrive on exploring different possibilities and approaches to tackle a problem. Our work involves rigorous research and experimentation to validate our hypotheses and develop viable solutions to solve real world problems. Our expertise lies in the integration and processing of data from different sensory devices, including visual, acoustic, lidar, Wi-Fi, Sonar, etc. Our primary focus is advanced computer vision techniques and machine learning algorithms to extract meaningful insights and valuable information from a diverse range of data sources.

If you want to stay updated on our projects in different AI and sensors engineering fields, consider subscribing to our medium channel. Likewise, feel free to reach out to me at TAN_Tiong_Kai@htx.gov.sg if you want to discuss ideas related to benchmarking experiments.

References

  1. Habana Labs “Intel Gaudi V1.14 Documentation.” 2024. [Online]. Available: https://docs.habana.ai/en/latest/
  2. M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” Cornell University, 2019.
  3. “Vision Transformer (ViT),” [Online]. Available: https://huggingface.co/docs/transformers/model_doc/vit
  4. J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Cornell University, 2020.
  5. “Zero-Reference Deep Curve Estimation (Zero-DCE) for Low-Light Image Enhancement,” [Online]. Available: https://li-chongyi.github.io/Proj_Zero-DCE.html
  6. C. Huyen, “Multimodality and Large Multimodal Models (LMMs),” 2023. [Online]. Available: https://huyenchip.com/2023/10/10/multimodal.html#clip

--

--