Demystifying XLA: Unlocking the Power of Accelerated Linear Algebra.

7 min readDec 21, 2023

Understanding XLA: Making Accelerated Linear Algebra Simple.

The field of Machine Learning (ML) has witnessed remarkable advancements in recent years, revolutionizing industries and transforming the way we interact with technology. As ML models become increasingly complex and datasets grow exponentially, there arises an inherent need for speed and efficiency. The success of machine learning applications hinges on the ability to process vast amounts of data swiftly and derive meaningful insights in real-time.

That’s where XLA (Accelerated Linear Algebra) comes in. XLA optimizes the mathematical operations used in ML, enabling faster execution and improved performance for ML models. In this article, we demystify XLA, exploring how it works and how it improves the speed and performance of ML models.

What Is XLA ?

XLA (Accelerated Linear Algebra) is a domain-specific compiler developed by Google. It focuses on optimizing and accelerating numerical computations, particularly linear algebra operations, on various hardware architectures.

XLA (Accelerated Linear Algebra) is an open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators.

Compiler in general is just a computer program that translates computer code written in one programming language (the source language) into another language (the target language) and the target language is usually lower than the source language as shown below.

Figure 1.1: A Demonstration of High-level to Low-level Transformation by Compilers.

Indeed, XLA does not compile traditional programming languages like C++ or Java. Instead, it operates on a specialized language called HLO (High-Level Optimizer) that is designed for expressing and optimizing mathematical operations in machine learning. XLA takes HLO code as input, performs optimizations, and generates optimized low-level machine code for execution on different hardware architectures.

HLO: The Language Behind XLA’s Magic!

HLO (High-Level Optimizer) is a specialized language used by the XLA (Accelerated Linear Algebra) compiler. It is designed specifically for expressing and optimizing high-level mathematical operations, particularly those involved in machine learning models.

In Figure 2, we present a showcase of HLO (High-Level Optimizer) code. This code illustrates a straightforward mathematical computation: the addition of two numbers, followed by squaring the result and multiplying it by a constant value.

Okay but where this HLO comes from ?

HLO is derived from the computational graph representation commonly used in frameworks like TensorFlow.

Therefore, HLO explains the execution graph, but what exactly is the graph and where does it come from?

In TensorFlow for example, computational graphs are used to represent the flow of operations in a machine learning model. Each node in the graph represents an operation, and the edges represent the data flow between these operation.

Figure 2.2: Mapping TensorFlow Code to Execution Graph

The graph is a visual representation of the code. We start with the written code, convert it into a graph, and use a language called HLO to represent this graph as shown in Figre 2.3.

Figre 2.3 The path of creating the XLA from the machine learning code

Now, when we have XLA’s input, let’s explore the role and significance of XLA in this context. XLA plays a crucial role in optimizing and accelerating the execution of numerical computations.

Optimizing Performance: XLA’s Function

After taking in the HLO code, XLA performs optimizations on it. These optimizations can be categorized into two types: target-independent and target-dependent optimizations.

Target-independent optimizations refer to the optimizations that can be applied regardless of the specific hardware target. These optimizations focus on improving the overall efficiency and performance of the HLO code. One common target-independent optimization is operation fusion.

Operation Fusion

Operation fusion is like combining several small tasks into a single big task to make things faster and more efficient. Imagine you have a list of things to do, such as adding numbers, multiplying them, and then subtracting the result from another number. Instead of doing each of these tasks separately, operation fusion allows you to do them all together as one big task. By combining them, you can save time and reduce the need to remember intermediate results. It’s like doing multiple things at once, making your calculations faster and more.

Let’s imagine we have a stack of math operations. In Figure 3.1, we see three operations: adding x to b, multiplying the result by a, and then multiplying the final result by x. Each operation takes the output of the previous one and performs a different mathematical calculation.

Operation fusion allows us to simplify this stack of operations. Instead of doing each step individually, we can combine them into a single operation as shown in Figure 3.2. It’s like compressing the steps to make things faster.

Figure 3.2: The Fused Operation — Simplified Calculation of h

So, if we fuse these operations together, we can calculate the final result in one go. It’s like doing all the math in our heads without writing down intermediate results. This fusion makes our calculations more efficient and saves time.

Figure 3.3: Conv-BN Operation Fusion — Combining Convolution and Batch Normalization

Another example of operation fusion in deep learning layers is the fusion between the convolutional operation (conv) and the batch normalization operation (batchnorm). This fusion combines both operations into a single optimized operation called “convolutional batch normalization fusion” or “conv-batchnorm fusion. as shown in Figure 3.3”

Figure 3.4: Conv-BN Operation Fusion — Combining Convolution and Normalization Equations

Conv-BN operation fusion involves combining the equations representing the convolution operation and the normalization operation into a single equation. This fusion simplifies the computation process by eliminating the need to perform these operations separately as shown in figure 3.4.

Target-dependent optimizations
Following the target-independent phase, XLA proceeds to send the HLO computation to the backend. The backend then performs further analysis and optimization at the HLO level, taking into account specific target information and requirements. For example, in the case of the XLA GPU backend, it excels at operator fusion customized for the GPU programming model, effectively dividing computations into streams.

All Roads Lead to LLVM

LLVM is a tool that efficiently converts high-level programming languages into machine code. It does this by using a versatile intermediate representation (IR), which acts as a platform-independent version of the code. The IR allows for optimizations and analyses to be done on the code before creating the executable or library.

Figure 4.1 Following the LLVM Path: Paving the Way to Efficient and Optimized Code

In the upcoming series of XLA articles, we will delve into LLVM in more detail. However, for now, it’s important to view LLVM as both an optimizer and a machine code generator.

Figure 4.2 From XLA-HLO to LLVM IR to Machine Code: Conversion and Optimization.

Once the XLA applies the necessary optimization operations to the High-Level Operator (HLO) representation, the HLO is converted into LLVM Intermediate Representation (IR). The LLVM IR is generated from the XLA HLO, serving as an intermediate representation that LLVM understands. It enables LLVM to perform further optimizations on the code. Eventually, LLVM utilizes the optimized LLVM IR to generate machine code specific to the target hardware as shown above.

Putting All Together

The process of converting code into machine code is a captivating journey that encompasses various optimizations and transformations. Now, let’s bring all the steps together and explore the broader perspective of XLA’s architecture, as depicted in Figure 5.1.

Figure 5.1 Architecture of accelerated linear algebra (XLA) compiler.

Independent Optimization: Once in HLO form, independent optimizations are applied to enhance overall efficiency. Techniques like operation fusion and loop optimizations streamline computations and eliminate redundancies.
Dependent Optimization: Moving forward, dependent optimizations come into play, tailored to the specific hardware target. These optimizations leverage knowledge about the target architecture, optimizing the code to exploit hardware features, memory layouts, and parallelism.
LLVM Transformation: After independent and dependent optimizations, the HLO code undergoes a transformation in the LLVM framework. LLVM acts as a powerful intermediary, providing tools and optimization passes to further improve the code’s efficiency.

Conclusion

XLA’s architecture encompasses a series of steps, from applying independent and dependent optimizations, utilizing the LLVM framework, to generating efficient machine code. This big picture view highlights the role of XLA in the code-to-machine transformation, enabling developers to unlock the full potential of their code and achieve optimal performance across various hardware targets.