Java and LLMs: are we there yet?
Performance analysis of LLM inference on the Java platform with Panama, Vector API, and TornadoVM.
Introduction
Large language models (LLMs) have revolutionised natural language processing (NLP) by enabling machines to generate human-quality text, translate languages, and answer questions informally. However, deploying LLMs in real-world applications can be challenging due to their computational demands.
Computations in LLMs are mainly dominated by vector and matrix multiplications which — based on their dimensionality — can be bound by the memory bandwidth on most hardware.
This often leads to memory bottlenecks, posing a significant obstacle across a spectrum of hardware configurations. Recently, a plethora of frameworks have emerged, each dedicated to optimizing LLM inference, with a specific focus on harnessing the computational prowess of Graphics Processing Units (GPUs). Notable among them are NVIDIA TensorRT-LLM, vLLM, OpenLL, and Ray Serve, which are all designed to streamline and enhance LLM computations on GPU architectures.
In this article, we assess the performance landscape of LLM inference on the Java platform to establish a baseline for future performance-oriented developments on the Java Virtual Machine.
Given that many widely used approaches often rely on native implementations (e.g. llama2.cpp,llama2.c, llama2 everywhere), the exploration in this article places particular emphasis on features like the Vector API and Project Panama. These features not only enhance parallelism but also introduce capabilities for off-heap data types. Finally, we demonstrate how TornadoVM can synergize with the aforementioned technologies to enable seamless GPU acceleration directly from the Java application.
LLAMA inference in Java
There is an emerging number of Java-written LLAMA inferences (e.g., llama2.java, jlama, java-llama.cpp, llama2j).
In this article, we focus on llama2.java because it makes use of Java Streams, project Panama, and the Vector API, to push the performance boundaries at a high level bringing it close to native-like performance.
In general, the Llama2-like transformer-based language model (LLM) architecture faces a memory-bound challenge mainly due to numerous matrix-vector computations performed on small data sizes, typically in the hundreds. Until recent advancements, the primary optimization options for handling this workload were either by utilizing the Parallel Stream API or by leveraging the auto-vectorization features offered by the C2 or Graal compilers.
With the introduction of the Vector API, Java developers now have the ability to write highly efficient vectorized code which is an ideal fit for LLM applications. As showcased in the table below, across different models ranging from (15M to 110M parameters), the Java implementation that combines Parallel Streams, Panama, and Vector API can reach up to 90% of the performance of the multithreaded llama2.c version.
In addition, we observe that as the model becomes larger the performance gap between the C and Java versions decreases since the problem becomes memory-bound for large data sizes.
Understanding further the performance boundaries
As we see in the llama2.java version, the forward phase of the LLM algorithm contains seven matmul operations (matrix-vector multiplications) which all have been vectorized with the Vector API.
To understand if further performance improvements can be applied, we profiled the llama2.java implementation with the IntelliJ profiler in order to understand the time distribution of the application.
After profiling (see the figure beneath), we noticed that the last (out of the seven) matmul operation accounts for 19% of the execution time of the whole forward method. The following code snippet showcases the profiled matmul method using the VectorAPI and its `hot` areas which makes it a strong candidate for further optimization.
Boosting performance further with TornadoVM
In general, all the matmul operations described above are inherently parallel which makes them ideal candidates for any sort of parallelization such as vectorization or GPU acceleration.
When it comes to GPU acceleration, however, special attention has to be given to input data sizes since data must be copied from the host memory to the GPU’s memory for processing. This data copy adds an overhead to the end-to-end execution time which must be factored into our decision-making regarding “which function should we offload to the GPU”.
The profiling results directed us to the last of the seven matmul operations which we chose to offload to the GPU acceleration by using TornadoVM.
TornadoVM addendum
TornadoVM is a plug-in to OpenJDK and GraalVM that allows programmers to run Java programs on heterogeneous hardware, such as GPUs and multi-core CPUs. Lately, TornadoVM v1.0 introduced significant improvements including a new API for off-heap object and array allocation using the Panama Memory Segment API, enabling more efficient memory management.
Accelerating llama2.java with TornadoVM
Previously, with just a few clicks directly from the IDE, we noticed that the main bottleneck of the application was the last matmul (matrix-vector) operation. In order to see if we can benefit from GPU acceleration, we took the following step to offload it to the GPU with TornadoVM. The full implementation is available in llama2.tornadovm.java. The following steps capture the key modifications required in the original applications:
- Use VectorFloat8 API of TornadoVM: We stored the read-only weights in the VectorFloat8 data structure.
// Convert FloatBuffer to primitive float
this.wclsAsPrimitive = new float[wcls.remaining()];
wcls.get(wclsAsPrimitive);
// Convert the read-only weights used in the last mat-mul to TornadoVM datatypes
// that use MemorySegments
this.weightInFloatArray = FloatArray.fromArray(wclsAsPrimitive);
this.weightInVectorFloat8 = createVectorFloat8Array(weightInFloatArray);
2. Write Matrix-Vector operation with TornadoVM. Note, that the following is functionally equivalent to the code snippet previously shown in the Vector API:
static void matrixVectorFloat8(float[] xout, VectorFloat8 x, VectorFloat8 w, int n, int d) {
for (@Parallel int i = 0; i < d; i++) {
float val = 0f;
int vectorLaneWidth = x.vectorWidth();
for (int j = 0; j < n; j += vectorLaneWidth) {
Float8 xv8 = x.get(j / vectorLaneWidth);
Float8 wv8 = w.get(i * (n / vectorLaneWidth) + j / vectorLaneWidth);
val += Float8.dot(wv8, xv8);
}
xout[i] = val;
}
}
3. Initialize a TaskGraph in the pre-processing stage of the forward method:
taskGraph =new TaskGraph("s0")
.transferToDevice(DataTransferMode.EVERY_EXECUTION, s.xVectorFloat8)
.transferToDevice(DataTransferMode.FIRST_EXECUTION, w.weightInVectorFloat8)
.task("t0", MatrixVectorCollection::matrixVectorFloat8, s.logits, s.xVectorFloat8, w.weightInVectorFloat8, dim, transformer.config.vocab_size)
.transferToHost(DataTransferMode.EVERY_EXECUTION, s.logits);
Note, that s.xVectorFloat8 stores the weights required for computing the final matrix-vector operation which consumes 20% of the original application. By setting it to DataTransferMode.FIRST_EXECUTION, we are going to transfer the read-only data onto the GPU during the first execution, and then reuse it. In this way, we avoid recurrent I/O on read-only data.
In the provided output, you can confirm the execution of the llama2-tornadovm inference. Subsequently, to ascertain that it is utilizing the GPU, enable the “ — threadInfo” flag and verify the relevant information.
Performance
Models: For the final performance evaluation, we used the TinyLlama project. The project aims to pre-train an impressive 1.1 billion Llama model using an astonishing amount of three trillion tokens. The creators of the model stated that they are utilizing the “precisely identical architecture and tokenizer” employed by Meta in training Llama 2. This enables seamless integration into open-source projects built on Llama through a plug-and-play approach.
System: The experimental platform for the Java application comprises a 13th Gen Intel Core i7–13700 CPU with 24 threads, an NVIDIA GeForce RTX 3070 GPU for accelerated computations, and runs on Pop!_OS Linux. The Java Development Kit (JDK) used is OpenJDK 21+35–2513, and TornadoVM v1.0 serves as an extension to enable offloading computational tasks to accelerators such as GPUs.
Speedup to llama2.java:
The following table summarizes the performance comparison between llama2.java and llama2.tornadoVM.java.
Both implementations run with Parallel Streams (24), off-heap data types with Panama, and Vector API. The only difference is that the dominant matmul operation in the forward phase has been offloaded to the GPU in the llama2.tornadoVM.java version versus the llama2.java version which executes it via the Vector API.
The initial findings showcased that by offloading only the most computationally intensive matrix-vector operation on the GPU, we managed to generate on average 13% (9%-15%) more tokens per second across all models.
Normalized speedup to native-C implementation running with OpenMP
We also compare the llama2.tornadoVM.java implementation against the native original llama2.c implementation (executed via OpenMP, 24 threads). The table below summarizes the results.
As shown, the llama2.tornadoVM.java implementation performs almost similarly (up to 98%) to the llama2.c version for large models, thereby closing further the performance gap between native C and Java.
Looking forward, can Java (or the JVM) do better?
Definitely!
The analysis presented in this article is just the beginning of the performance improvements that we can do in the Java platform with respect to LLMs.
The combination of Parallel Streams, Panama, Vector API, and TornadoVM enables the implementation of a plethora of advanced optimizations that can be also applied in the area of LLMs and AI workloads in general.
Firstly, as the data sizes increase the benefits of using GPU acceleration will become more evident since the data copy overheads are offset by the performance benefits of GPU execution. Work is already being undertaken to execute even larger models with the llama2.tornadoVM.java version via quantization. This may result in more functions becoming worthy of GPU offloading in the long run.
Secondly, underlying optimizations such as shared or unified memory via TornadoVM can boost performance even further specifically in the presented hybrid model where different parts of the LLM applications execute on the CPU (scalar and vector units) and GPUs.
Finally, compiler optimizations specifically for AI workloads have the potential to yield further performance improvements based on the hardware unit/platform that a specific workload may execute (e.g. TornadoVM’s kernel API, Mojo).
Final words
The Java platform is rapidly evolving and performance has become a first class citizen. As described in this article, by using existing features of the JVM (Streams, Panama, Vector API) along with TornadoVM we have the potential to achieve near native-C like performance with a positive outlook to even surpass it. In addition, consolidating the compiler technology under Graal enables us to perform a variety of optimizations at different levels of the software stack targeting different hardware platforms and accelerators in a unified and simplified manner. The Java Virtual Machine is evolving rapidly and new projects (e.g. Babylon, Valhalla, etc.) will keep pushing the performance boundaries making it an ideal candidate even for AI/ML workloads.
Acknowledgement
Special thanks to Alfonso Peterssen of Oracle for his contributions and assistance in extending the original llama2.java application.