Java on GPGPUs

At the 2015 International Conference on Principles and Practices of Programming on the Java Platform (PPPJ) there was a session on running JVM programs on GPGPU accelerators (including one paper by me). Taking parallel regions out of the JVM and running them on GPGPUs has been a hot research topic for a few years now, thanks to the reported performance improvements. However, running JVM programs on a GPGPU introduces many overheads that can severely offset the performance benefit you might expect from GPGPU offload. At the same time, a lot of research and engineering has gone into the problem in the last five years.

You can generally divide JVM-on-GPGPU frameworks into two camps: frameworks that are implemented inside the JVM, and frameworks that are implemented as libraries that sit on top of it. From inside the JVM many more optimizations and techniques are possible (e.g. inspecting local variables to predict control flow and optimize communication). Sitting on top of the JVM limits optimizations but improves flexibility and supports execution on whatever JVM platform you prefer. The classic examples of JVM-based projects would be Project Sumatra and IBM’s ongoing work. I would argue that the state of the art in the library-based approach is APARAPI (and the projects that build on it). My experience focuses in using the library-based approach so I’ll primarily speak to that. Despite the differences between frameworks, they each follow a similar workflow for actually extracting a parallel computational region from the JVM and executingit on a GPGPU.

Note: I won’t be discussing things like JavaCL, ScalaCL, JCUDA, etc. Frameworks like these are simple wrappers around some GPGPU API. While these frameworks are helpful in accessing GPGPUs from the JVM for developers who are already well-versed in GPGPU programming, I want to focus on higher level models that Java, Scala, etc. programmers would be more comfortable in.

A Simple JVM-GPGPU Framework

Let’s simplify the problem first and generalize it later. Consider a parallel vector addition on primitive arrays inside a Java program:

int[] sum = ...
int[] input1 = ...
int[] input2 = ...
IntStream.range(0, sum.length).parallel().forEach((i) -> {
sum[i] = input1[i] + input2[i];
});

Offloading this computation to a GPGPU requires:

  1. Code generation from the lambda’s bytecode to a GPGPU executable format (possibly through some IR like OpenCL or CUDA).
  2. Analysis of the lambda’s bytecode to understand what values are referenced, what format its data is in, etc.
  3. Data transfer from the JVM to the GPGPU and back for any inputs or outputs referenced from the lambda.
  4. Some management bridge code that orchestrates the whole process, selects a GPGPU to use, waits for completion, etc.

Most frameworks generate OpenCL or CUDA directly from JVM bytecode, though some produce a lower level representation (e.g. NVVM IR). Most modern frameworks do this code generation at runtime which adds overhead, though some (e.g. Rootbeer) are forced to do it at compile-time by the GPGPU programming platform they build on. The best writeup I’ve seen explaining an approach to this code generation in an understandable and reasonably detailed way is this document. We won’t go into details here beyond saying that the input bytecode generally has some constraints placed on it by the code generation framework (e.g. no throwing exceptions), though these constraints are gradually being lifted by research in the field (e.g. this paper on maintaining exception semantics in GPGPU-accelerated regions). From our vector add example, a framework might generate something like the below OpenCL kernel:

// Emulate the this object in Java
typedef struct This_t {
__global int *sum;
__global int *input1;
__global int *input2;
} This;
// OpenCL implementation of the main lambda
static void apply(This *this, int i) {
this->sum[i] = this->input1[i] + this->input2[i];
}
// Wrapper code the handles setup and calls lambda logic
__kernel void run(__global int *sum, __global int *input1,
__global int *input2, int N) {
int tid = get_global_id(0);
int nthreads = get_global_size(0);
  This this;
this.sum = sum;
this.input1 = input1;
this.input2 = input2;
  // Run the lambda's logic on each input
for (int i = tid; i < N; i += nthreads) {
apply(&this, i);
}
}

Frameworks generally combine their code generation and code analysis passes. The most important metadata output from the analysis pass is the data accessed from the parallel region. This metadata usually takes the form of a set of (object reference, field) pairs indicating the field of an object that is referenced from the kernel. Commonly, frameworks restrict these referenced items to be either primitives or arrays of primitives (as in our example).

When only primitives and arrays of primitives are supported, actually making these referenced values visible to a GPGPU kernel is fairly straightforward through the Java Native Interface (JNI). A primitive value can be loaded from the JVM using Java reflection, passed to a native function through JNI, and made available to the GPGPU kernel through an API call (e.g. clSetKernelArg). Arrays of primitives are slightly more complex but involve the same basic steps: Fetch the array reference using reflection, pass it to a JNI function, and then get the backing bytes before transferring them to the GPGPU, e.g.:

void *native_arr = jenv->GetPrimitiveArrayCritical(jarr, NULL);
clEnqueueWriteBuffer(cmd, mem, CL_TRUE, 0, len, native_arr, 0,
NULL, NULL);
jenv->ReleasePrimitiveArrayCritical(jarr, native_arr,
JNI_ABORT);

Copying data back from the GPGPU for primitive arrays is similar but in reverse.

Depending on the complexity of the analysis supported by a specific framework or the semantics of its programming model, it may be able to determine the relation of the array to a kernel (input, output, or both) and optimize these transfers using that information.

Common Restrictions

Because of the challenges of supporting code generation and data serialization to GPGPUs from the JVM for arbitrary Java programs, JVM-GPGPU frameworks commonly restrict the kernels they support in the following ways:

  1. No object references from the kernel, only primitives and arrays of primitives (no access to custom user classes or standard java.util classes)
  2. No user or runtime exceptions thrown
  3. No dynamic memory allocation (i.e. new)
  4. No dynamic method/type resolution that cannot be resolved prior to kernel execution
  5. Programmer explicitly selects to GPGPU or to not GPGPU, up to them to make the right decision for each input and kernel.

Easing These Restrictions

Recent work has looked at how these restrictions can be reduced or eliminated, and the performance tradeoffs involved in that extension.

Tackling references to instances of use-defined classes from accelerated regions requires supporting 1) data serialization from JVM objects to some format consumable by the GPGPU, and 2) code generation that can represent and manipulate that format while calling instance methods on it. There are several papers that look at different approaches. If we again restrict the problem to only arrays of objects with primitive fields, a common approach would be to use a ByteBuffer in the JVM along with reflection to serialize composite objects into a packed representation inside before transferring them to the GPGPU. This packed representation could be typed in the auto-generated GPGPU kernel as a struct so long as care is taken to ensure fields are aligned and bytes ordered correctly. On the other hand, work like that done in JaBEE aims to support more arbitrary JVM objects, including those with nested object references.

Work from 2013 looked at maintaining Java exception semantics for runtime exceptions (including out-of-bounds and divide-by-zero exceptions). This paper took an interesting approach to the problem: By extracting the subset of the bytecode of a kernel that is necessary to produce the same exception behavior as the original kernel, a lightweight and much quicker version of the original parallel region could be executed in the JVM. This lightweight parallel region would guarantee that if the original kernel threw an exception, so would it. Meanwhile, the full computational kernel could be speculatively executed in isolation on a GPGPU. If the subsetted kernel in the JVM threw an exception, the results on the GPGPU would be ignored and program behavior would be unchanged. If no exception was thrown, then it would be possible to achieve performance improvement when 1) the subsetting technique successfully reduced the work performed in the JVM, and 2) the kernel ran faster on the GPGPU as well. Given an original parallel JVM execution time of T, a subsetted kernel execution time of S, and a GPGPU execution time of G, the achieved speedup could be expressed as T/MAX(S, G). Note that this work does not handle user exceptions, though Section VI of the Rootbeer paper briefly describes their techniques for that case.

The only GPGPU framework that supports some kind of dynamic memory allocation out of the box is CUDA, but it is possible to emulate dynamic memory allocation in other frameworks (like OpenCL) with a global heap and an atomically incremented free pointer. These techniques can then be used to enable the use of the “new” keyword in accelerated parallel regions. For idempotent kernels, it is even possible to support a working set size larger than a GPGPU’s physical memory size using an abort-and-retry technique detailed in Section 3.4 of this paper.

Impressively, the JaBEE paper also proposes a technique for supporting dynamic method resolution from the GPGPU by building a lookup table on the device. However, based on the results shown in the paper it would seem this lookup table (and probably the serialization of nested object references too) add significant overheads to the framework.

Auto-selection of accelerators for kernels is a hard but popular problem at the moment. It is both unclear which inputs to a performance model are most important (e.g. kernel features, data characteristics, parallelism, data sizes, etc), but also unclear which techniques or algorithms produce the best models (e.g. linear regression, SVM, genetic algorithms, neural nets, etc). One paper at PPPJ 2015 written by a colleague did a great job of investigating the use of several important bytecode and dataset features as inputs to a performance model trained offline using SVMs. This model could then be used at runtime to do binary classification. The result of that classification would decide if a certain instance of a kernel ran in parallel on the JVM or on the GPGPU. If you are interested in this topic I highly recommend this paper: It includes a comprehensive survey of the current research into scheduling decisions between the JVM and GPGPUs.

Performance Subtleties

Arguably the most attractive feature of OpenCL is to generate code once but run it on multiple platforms (much like the JVM). In particular, OpenCL kernels auto-generated from JVM bytecode can be offloaded to either a GPGPU (as this post has focused on) or a multi-core CPU, usually the same CPU that the host JVM program is running on. This produces some interesting performance results and complicates the decision of where to run a particular instance of a kernel running on a particular dataset.

Speedup relative to Java 8 parallel streams of a load-balancing runtime (HJlib), native OpenCL threads on a multi-core CPU (CPU), and native OpenCL threads on a GPU (GPU).

See the table to the left. Here we have the speedup of a variety of platforms relative to Java Parallel Streams as we vary the dataset size (second column) for a few benchmarks (first column). The third column reports speedups for a load-balancing, multi-threaded JVM runtime. We can expect these results to be similar to the parallel streams results (and you can see this is true for most cases). The fourth column reports speedups for offloading to a multi-core CPU using auto-generated OpenCL kernels, and the fifth column is for a many-core GPU instead. For each (benchmark, dataset) pair the highest performing of these three platforms is highlighted in grey. Note that for all benchmarks there is a clear progression of grey boxes from the top-left to the bottom-right, indicating that for larger datasets GPUs perform better and for smaller datasets it is better to stay in the JVM and avoid the overheads of GPU offload.

More interestingly, there is a sweet spot for PageRank and KMeans where the dataset size is not sufficiently large or does not expose sufficient parallelism to fully utilize the GPU, but is large enough that we gain some performance benefit from moving out of the JVM and into native CPU threads even though we still pay some overhead (albeit less than for the GPU). The presence of this sweet spot for native CPU threads may increase the dimensionality of the scheduling decision in some cases from JVM vs. GPU to JVM vs. CPU vs. GPU.

Continuing Work

There is a tension between how much of the JVM bytecode specification can be supported on accelerators and how much should be supported. The choice to move to accelerators like GPGPUs is usually made primarily for performance, and secondly for energy efficiency. The goal of research like the projects described above should be to investigate how certain portions of the JVM specification are supportable on the accelerator, what compromises must be made on performance and/or energy efficiency, and how those compromises change with application or dataset characteristics. JaBEE did this for indirect object references and dynamic method resolution. Other work did this for exceptions, barrier synchronization, and dynamic memory allocation. Through experimentation the community can converge on a reasonable subset of the JVM bytecode specification that can be supported on accelerators without sacrificing performance/energy efficiency and retaining sufficient JVM semantics to remain useful and intuitive for JVM programmers.

While the state-of-the-art in offloading JVM parallel regions to GPGPUs is far ahead of where it was 5 years ago, there are still many ongoing projects tackling problems in this area. In my own research group alone, current work focuses on 1) GPGPU acceleration of Apache Spark kernels, 2) improved scheduling decisions using offline and online training of various machine learning-based performance models, and 3) using JVM-GPGPU programming models to teach GPGPU programming to university students. If you are interested in any of these topics, stay tuned for future posts or contact me at jmg3@rice.edu.

Related Resources

  1. APARAPI
  2. Rootbeer
  3. Project Sumatra
  4. JaBEE
  5. PPPJ 2015 Proceedings