Enter the parallel universe of Java’s Vector API

jeudi 17 avril 2025, 11:00 , par InfoWorld

If there is one thing you can describe as an obsession for both developers and devops, it’s how to improve the performance of applications. Ultimately, better performance leads to lower costs (through reduced utilization of resources) or bigger profits (by delivering an improved service, thus attracting more customers).

Of course, there are many, many ways to improve performance, but one of the more obvious is to “divide and conquer.” Let’s say you have optimized your algorithms and upgraded your hardware, but you’re still not achieving the performance you need. The solution might lie deeper in the stack—at the CPU level—where vector operations can process multiple data points simultaneously. Being able to do more than one thing at a time will often (but not always) reduce how long it takes to complete a task.

The terms concurrent and parallel are often used interchangeably when discussing improving performance, so it’s worth explaining the difference between them.

Two tasks are said to execute concurrently if the second task starts executing after the first has started and before the first has finished. There is no requirement that, at any time, both tasks execute simultaneously. This technique has been used for a long time, especially in operating systems. Before the dawn of multi-core, multi-CPU machines, a single execution unit would have to be shared among all processes. To give the illusion that processes were running simultaneously, several would run concurrently and share the execution unit by swapping between them very quickly (this is called a time-sharing operating system).

For tasks to execute in parallel, they must execute simultaneously, not just overlap their execution.

How vector processing works

Moore’s law has squeezed more transistors into the same space. To extract more performance, we now have multi-core processors, which allow concurrent processes to execute in parallel. At a lower level, the CPU also contains hardware for parallel execution of specific types of tasks, which are referred to as vector operations.

Let’s say you have a set of numbers you want to process by applying the same operation to all of them. For example, all of the values need to be incremented by one. In Java, the typical way to handle this would be to store all your values in an array, create a loop to iterate over the array and add one to each value in the body of the loop. When you run a Java application, frequently used code will be compiled from the bytecodes of the virtual machine instruction set into native instructions. The JVM does this using a just-in-time (JIT) compiler.

The JIT is smart enough to understand the underlying processor architecture and will optimize the loop to use vector operations (this is called autovectorization).

Vector processing uses very wide registers to hold more than one value. For example, the AVX-2 Intel instructions make use of 256-bit wide registers. Java integers are stored in 32 bits, so each vector register can hold eight Java integers (ints). The JIT will generate code to load values from the array in groups of eight. The code can then use one of the AVX-2 instructions to tell the CPU to add one to each of these eight values independently (and deal with any overflow so neighboring values are not corrupted). This is true parallel processing since all values are processed in a single machine instruction cycle. The net effect is that processing the array takes only an eighth of the time to process as it does without autovectorization.

This all sounds wonderful and means that Java developers can code how they want and let the JIT compiler optimize for them at runtime.

Unfortunately, this is not the whole story…

Autovectorization works well for simple situations like the one just described. However, making the loop even slightly more complicated can quickly defeat the JIT compiler’s ability to improve performance in this way. If we add a simple conditional in the body of the loop to test whether the value should be incremented, the JIT will revert to using a sequential approach and not use vector operations.

Enter the Java Vector API

One solution to this is to allow Java developers to write code that is explicit about how vector operations should be used. The JIT compiler can translate this directly without the need for autovectorization. This is what the Java Vector API, introduced as an incubator module in JDK 16, is designed to do. Interestingly, this API holds the record for the longest incubating feature in OpenJDK, as it will be in its ninth iteration with the release of JDK 24. As an aside, this is not because it is in a perpetual state of flux but because it is part of a larger project, Valhalla. When Valhalla, which will add value types to Java, is delivered in the OpenJDK, the Vector API will become final.

The Vector API provides a comprehensive set of functionality. First, there are classes to represent each Java primitive numeric type as a vector. A vector species combines these primitive vector forms with CPU-specific registers, so it is simple to understand how to populate data from an array. Vectors can be manipulated using a rich set of operators. There are 103 of them, which cover everything you will realistically need.

The Vector API provides developers with everything they need to enable the JIT compiler to generate highly optimized code for numerically intensive operations. Since most things result in manipulating numbers (strings are, after all, just sequences of characters encoded to numbers), this can lead to significant performance improvements.

Ideally, the Vector API would not be required; autovectorization would handle this transparently. The good news is that high-performance JVMs include a different JIT compiler. The Falcon JIT compiler (which replaces the OpenJDK C2 JIT) is based on another open-source project, LLVM. This can recognize substantially more cases where vectors can be used, leading to better-performing applications without requiring code changes.

The Falcon JIT compiler is available in the Azul Platform Prime JDK, which is free for development and evaluation. As a TCK (Technology Compatibility Kit)-tested JDK, Azul Platform Prime is a drop-in replacement. Why not try it out with your applications?

Simon Ritter is deputy CTO and Java champion at Azul.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Lire la suite sur InfoWorld