Java On The GPU - Inside Java Newscast #58

Babylon is OpenJDK's newest big project, aimed at easing execution of Java code on the GPU, which will unlock machine learning and artificial intelligence on the JVM. Here's all about its background, prototypes, and plans.

Always embed videos

(and give me a cookie to remember - privacy policy)

Watch on YouTube

Welcome everyone to the Inside Java Newscast, where we cover recent developments in the OpenJDK community. I'm Nicolai Parlog, Java Developer Advocate at Oracle, and today we're gonna talk about how we can get a function like "4x * (y - sin(xy))" to be executed on the GPU, your graphics card's processing unit. Or in a distributed cluster. Or as part of a LINQ-like mechanism in Java. Or how Java code can differentiate it. But mostly, the graphics card aspect because executing Java code on the GPU unlocks the whole world of machine learning.

static double f(double x, double y) {
	return 4.0d * x * (y - Math.sin(x * y));
}

Ready? Then let's dive right in!

Sisyphus in Babylon

So what are we really talking about here? Big picture: I want to express some query or computational logic, like the function I mentioned, in my Java code and then pass it to some library, which needs to analyze the expression and check whether it makes sense in its domain. If so, it translates the expression and processes it in its environment, for example by sending it to the GPU for execution.

So the first step is for me to express the computational logic. The most natural way to do that would be to write a Java method but what artifacts can a library get out of that?

  • the Java source code itself
  • its representation as an abstract syntax tree, an AST, during compilation
  • the bytecode generated by the compiler
  • a method handle or java.lang.reflect.Method instance at run time

The problem is that none of these are good artifacts to analyze or translate the logic:

  • to work with Java source code, you need to build half a compiler
  • to get to the AST, you (again) need your own compiler or break into javac internals and deal with the fact that the AST is somewhat idiosyncratic and contains artifacts of the Java language that complicate your work here
  • bytecode is a low-level instruction set that almost always requires uplifting into a less detailed level of abstraction
  • reflection, on the other hand, is too abstract: it knows modules, packages, classes, and methods, but it stops there - it has no understanding of or access to what's going on inside a method

Because of these limitations, projects in this space have to do a lot of heavy lifting. Take TornadoVM as an example: To understand and translate Java code to GPU kernels, it uses an entire compiler, the Graal JIT to be precise, to access and further process its intermediate representation of the code. Other such projects resort to expressing the logic in strings or by building data structures that represent computations.

// hypothetical API that allows creation of data structures
// that describe a computation - here: 4x(y-sin(xy))
var fModel = func("f", methodType(double.class, double.class, double.class))
	.body(entry -> {
		var x = entry.parameters().get(0);
		var y = entry.parameters().get(1);

		var r = entry.op(mul(
			entry.op(constant(DOUBLE, 4.0)),
			entry.op(mul(
				x,
				entry.op(add(
					y,
					entry.op(neg(
						entry.op(call(MATH_SIN,
							entry.op(mul(
								x,
								y))))))))))));
		entry.op(_return(r));
	});

These approaches work, but they're far from optimal. Imagine instead a solution that lets you write straight-up Java code and pass it as a lambda or method reference. This solution would then work directly with this lambda, to interpret it as a GPU kernel, a partial SQL command or a LINQ-like expression, as a mathematical function, or as a computational recipe to be executed in a distributed cluster.

But how could Java support all these concepts and languages? Going one by one and baking them directly into the Java Language Specification would not only be a Sisyphean task due to the sheer number of possible languages and the effort it takes to evolve the Java specification and implementation, it would also lead to a truly Babylonian confusion with various, potentially conflicting code models in the same language.

What if, instead, Java gained a mechanism that allowed libraries to implement the support themselves. Not only would that keep a lot of complexity out of the specification and runtime, it would also unlock the power of Java's ecosystem to provide innovation through competing solutions for common problems as well as niche solutions for niche problems that would never have made it into the spec anyway.

And this is exactly what the brand new Project Babylon sets out to accomplish. Its main thrust is code reflection. Another important exploration is the so-called Heterogeneous Accelerator Toolkit (HAT) for GPU computation. Let's have a look at both.

Code Reflection

In early August Paul Sandoz, Library Architect at Oracle, presented on this topic at the JVM Language Summit. In fact, most of what I've said so far is just a summary of that talk that I highly recommend you check out if you have a deeper interest in this topic. In that talk, he presented the concept of code reflection, which is an enhancement of Java reflection as we know it today.

Programmers would identify an area of source code, maybe by annotating a method or by passing it as a lambda or method reference, for which Java would then build a so-called code model that can be accessed at compile time and at run time. The code model describes the Java program in a symbolic form down to individual variable declarations, method calls, arithmetic operations, etc. It's a detailed description of that program, much like bytecode but instead of being designed and optimized for execution by the JVM, it targets libraries that need to access, analyze, and transform a Java program and has APIs that allow just that.

An interesting aspect that Paul pointed out is that there's no one level of abstraction that suits all use cases. Instead he envisages an interval of abstraction within which the code model can be lowered and lifted. That would allow libraries to implement a wide variety of language integrations based on Java, from math to machine learning to LINQ...

Babylon: "I think we can do better than LINQ"

... ok, maybe even better than LINQ. But, importantly, also GPGPU, general purpose computing on a GPU. Let's have a look!

Heterogeneous Accelerator Toolkit (HAT)

The term GPGPU describes computation on a graphical processing unit that is not intended to produce an image but to perform general computation that would classically be assigned to the CPU. Why would you do that? Well, the GPU in the PC recording this, for example, has over 4,000 cores - if done right, GPGPU can lead to ridiculous speedups. And there are a number of projects enabling Java code to offload computation that way, for example Aparapi and the aforementioned TornadoVM. One of the Aparapi veterans is Oracle's Gary Frost who started putting code reflection into practice to build HAT, the Heterogeneous Accelerator Toolkit - and he presented his results at that same JVMLS.

HAT requires all data that goes to or comes from the GPU to be stored off-heap because that makes things easier and faster, for example by allowing the GPU to allocate data directly. It uses Project Panama's foreign function and memory API to handle off-heap data without a headache. And thanks to FFM's support for mapping complex objects, there's no need to slice them into primitive array - HAT can use FFM to map complex objects to, for example, C99 data layouts that the GPU can use directly.

As for the code that needs to be executed, HAT relies on code reflection to interpret it. So developers can write somewhat normal Java code - "somewhat normal" because there are still stark differences between the JVM and GPUs, for example only one of them knows what an exception is while the other has a multi-tier memory hierarchy. So somewhat normal Java code on one side and after translation through code reflection and the new class-file API GPU-architecture specific code on the other side.

There's already a prototype that does this and its performance looks pretty good. I'm not in this space, so I can't really judge the details but for facial recognition, on Gary's laptop no less, the OpenCL-code HAT produced performed 10 times faster than the parallel code written in Java, which was already about 7 times faster than the sequential implementation. That's pretty impressive to me but I'm sure there's much, much more to gain once vendors get their hands on this and start creating optimized translations.

Ubi Es & Quo Vadis

So where are we on this project? Since JVMLS in early August, Project Babylon was founded with Paul Sandoz as its lead, so it's still early days. But he's already planning to release the code reflection prototype in the coming weeks and HAT might be bundled with it. While the plan for code reflection is to evolve, stabilize, and eventually release as part of OpenJDK, HAT has a different trajectory.

Instead of baking one GPGPU API into the JDK, HAT is planned to stand as one of many outside libraries that allow Java on the GPU. As part of that, its next steps are to work with GPU vendors to agree on common sets of data types for the API to define. As for the backend-specific implementations, for example for CUDA or OpenCL, the goal is to have the respective vendors implement theirs.

If you want to learn more about all of this, watch the JVMLS presentations by Paul Sandoz about code reflection (slides), Gary Frost about HAT, and while you're there, Juan Fumero's about TornadoVM - all three are super interesting. They're linked in the description or you can check out all JVMLS videos in this playlist. Don't forget to like and share the video, so more Java developers get to see it and subscribe, so I'll you again in two weeks. So long...