High-performance programming with Java streams

jeudi 18 décembre 2025, 10:00 , par InfoWorld

My recent Java Stream API tutorial introduced Java streams, including how to create your first Java stream and how to build declarative stream pipelines with filtering, mapping, and sorting. I also demonstrated how to combine streams, collectors, and optionals, and I provided examples of functional programming with Java streams. If you are just getting started with Java streams, I recommend starting with the introductory tutorial.

In this tutorial, we go beyond the basics to explore advanced techniques with Java streams. You’ll learn about short-circuiting, parallel execution, virtual threads, and stream gatherers in the Java Stream API. You will also learn how to combine and zip Java streams, and we’ll conclude with a list of best practices for writing efficient, scalable stream code.

Get the code
See the Java Challengers GitHub repository for the example code presented in this article.

Short-circuiting with Java streams

A stream pipeline doesn’t always need to process every element. In some cases, we can use short-circuiting. These are operations that stop the stream processing as soon as a result is determined, saving time and memory.

Here’s a list of common short-circuiting operations:

findFirst() returns the first match and stops.

findAny() returns any match (more efficient in parallel).

anyMatch() / allMatch() / noneMatch() stops the stream once the outcome is known.

limit(n) defines an intermediate operation that processes only the first n elements.

Here’s an example of short-circuiting operations in a Java stream pipeline:

import java.util.List;

public class ShortCircuitDemo {
public static void main(String[] args) {
List names = List.of('Duke', 'Tux', 'Juggy', 'Moby', 'Gordon');

boolean hasLongNames = names.stream().peek(System.out::println).anyMatch(n -> n.length() > 4);
}
}

The output for this pipeline will be:

Duke
Tux
Juggy

After 'Juggy', the pipeline stops. That’s because it has served its purpose, so there is no need to evaluate Moby or Gordon. Short-circuiting takes advantage of the laziness of streams to complete work as soon as possible.

Parallel streams: Leveraging multiple cores

By default, streams run sequentially. When every element can be processed independently and the workload is CPU-intensive, switching to a parallel stream can significantly reduce processing time.

Behind the scenes, Java uses the ForkJoinPool to split work across CPU cores and merge the partial results when it’s done:

import java.util.List;

public class ParallelDemo {
public static void main(String[] args) {
List names = List.of('Duke', 'Juggy', 'Moby', 'Tux', 'Dash');

System.out.println('=== Sequential Stream ===');
names.stream().peek(n -> System.out.println(Thread.currentThread().getName() + ' -> ' + n)).filter(n -> n.length() > 4).count();

System.out.println('n=== Parallel Stream ===');
names.parallelStream().peek(n -> System.out.println(Thread.currentThread().getName() + ' -> ' + n)).filter(n -> n.length() > 4).count();
}
}

Here, we compare output from sequential and parallel processing in a typical multi-core run:

=== Sequential Stream ===
main -> Duke
main -> Juggy
main -> Moby
main -> Tux
main -> Dash

=== Parallel Stream ===
ForkJoinPool.commonPool-worker-3 -> Moby
ForkJoinPool.commonPool-worker-1 -> Juggy
main -> Duke
ForkJoinPool.commonPool-worker-5 -> Dash
ForkJoinPool.commonPool-worker-7 -> Tux

Sequential streams run on a single thread (usually main), while parallel streams distribute work across multiple ForkJoinPool worker threads, typically one per CPU core.

Use the following to check the number of available cores:

System.out.println(Runtime.getRuntime().availableProcessors());

Parallelism produces real performance gains only for CPU-bound, stateless computations on large datasets. For lightweight or I/O-bound operations, the overhead of thread management often outweighs any benefits.

Sequential versus parallel stream processing

The program below simulates CPU-intensive work for each element and measures execution time with both sequential and parallel streams:

import java.util.*;
import java.util.stream.*;
import java.time.*;

public class ParallelThresholdDemo {
public static void main(String[] args) {
List sizes = List.of(10_000, 100_000, 1_000_000, 10_000_000);

for (int size: sizes) {
List data = IntStream.range(0, size).boxed().toList();

System.out.printf('%nData size: %,d%n', size);
System.out.printf('Sequential: %d ms%n',
time(() -> data.stream().mapToLong(ParallelThresholdDemo::cpuWork).sum()));
System.out.printf('Parallel: %d ms%n',
time(() -> data.parallelStream().mapToLong(ParallelThresholdDemo::cpuWork).sum()));
}
}

static long cpuWork(long n) {
long r = 0;
for (int i = 0; i < 200; i++) r += Math.sqrt(n + i);
return r;
}

static long time(Runnable task) {
Instant start = Instant.now();
task.run();
return Duration.between(start, Instant.now()).toMillis();
}
}

Now let’s look at some results. Here’s a snapshot after running both sequential and parallel streams on an Intel Core i9 (13th Gen) processor with Java 25:

Data sizeSequential streamsParallel streams10,0008 ms11 ms100,000 78 ms41 ms1,000,000 770 ms140 ms10,000,000 7,950 ms910 ms

At small scales (10,000 elements), the parallel version is slightly slower. This is because splitting, scheduling, and merging threads carries a fixed overhead. However, as the per-element workload grows, that overhead becomes negligible, and parallel processing begins to dominate.

Performance thresholds also differ across processors and architectures:

Intel Core i7/i9 or AMD Ryzen 7/9: Parallelism pays off once you process hundreds of thousands of elements or heavier computations. Coordination costs are higher, so smaller datasets run faster with sequential processing.

Apple Silicon (M1/M2/M3): Thanks to unified memory and highly efficient thread scheduling, parallel streams often become faster even for mid-size datasets, typically after a few hundred to a few thousand elements, depending on the work per element.

The number of elements isn’t the key variable; what you want to watch is the amount of CPU work per element. If computation is trivial, sequential execution remains faster.

Guidelines for using parallel streams

If each element involves significant math, parsing, or compression, parallel streams can easily deliver five to nine times the processing speed of sequential streams. Keep these guidelines in mind when deciding whether to use parallel streams or stick with sequential processing:

Cheap, per-element work requires tens of thousands of elements before parallelism pays off.

Benefits appear much sooner for expensive, per-element work.

Use sequential processing for I/O or order-sensitive tasks.

Pay attention to hardware and workload specs—these will define where parallelism begins to make a difference.

Parallel streams shine when each element is independent, computation is heavy, and there’s enough data to keep all the CPU cores busy. Used deliberately, parallel streams can unlock large performance gains with minimal code changes.

Performance-tuning parallel streams

Parallel streams use the common ForkJoinPool, which, by default, creates enough threads to fully utilize every available CPU core. In most situations, this default configuration performs well and requires no adjustment. However, for benchmarking or fine-grained performance testing, you can run a parallel stream inside a custom ForkJoinPool:

import java.util.concurrent.*;
import java.util.stream.IntStream;

public class ParallelTuningExample {
public static void main(String[] args) {
ForkJoinPool pool = new ForkJoinPool(8);

long result = pool.submit(() ->
IntStream.range(0, 1_000_000).parallel().sum()
).join();
}
}

Using a dedicated ForkJoinPool lets you experiment with different levels of parallelism to measure their impact on performance, without affecting other parts of the application.

Avoid changing the global setting
The global setting system.setProperty('java.util.concurrent.ForkJoinPool.
common.parallelism', '8') modifies the behavior of the common pool across the entire JVM, which can lead to unpredictable performance in unrelated code.

Remember: Parallel streams deliver benefits only for CPU-bound, stateless operations where each element can run independently in parallel. For small datasets or I/O-bound work, the overhead of parallelism usually outweighs its benefits. In these cases, sequential streams are faster and simpler.

Streams and virtual threads (Java 21+)

Virtual threads, introduced in Java 21 via Project Loom, have redefined Java concurrency. While parallel streams focus on CPU-bound parallelism, virtual threads are designed for massive I/O concurrency.

A virtual thread is a lightweight, user-mode thread that does not block an underlying operating-system thread while waiting. This means you can run thousands—or even millions—of blocking tasks efficiently. Here’s an example:

import java.util.concurrent.*;
import java.util.stream.IntStream;

public class ThreadPerformanceComparison {
public static void main(String[] args) throws Exception {
int tasks = 1000;

run('Platform Threads (FixedPool)',
Executors.newFixedThreadPool(100), tasks);

run('Virtual Threads (Per Task)',
Executors.newVirtualThreadPerTaskExecutor(), tasks);
}

static void run(String label, ExecutorService executor, int tasks) throws Exception {
long start = System.nanoTime();

var futures = IntStream.range(0, tasks).mapToObj(i -> executor.submit(() -> sleep(500))).toList();

// Wait for all to complete
for (var future: futures) {
future.get();
}

System.out.printf('%s finished in %.3f s%n',
label, (System.nanoTime() - start) / 1_000_000_000.0);

executor.shutdown();
}

static void sleep(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}

Output example: Platform Threads ≈ 5 s, Virtual Threads ≈ 0.6 s.

You probably noticed there are two executors in this example. Here’s how each one works:

newFixedThreadPool(100):  Creates an executor backed by 100 platform threads (real operating system threads). At most, 100 tasks run concurrently, while additional tasks wait in the queue until a thread is available. Each platform thread stays fully blocked during Thread.sleep() or I/O operations, which means those 100 threads can’t do other work until the blocking call completes.

newVirtualThreadPerTaskExecutor(): Creates one virtual thread per task. Virtual threads are cheap, user-mode threads that don’t tie up an operating system thread when blocked. An analogy would be a few delivery trucks (platform threads) handling millions of packages (virtual threads). Only a handful of trucks drive at once, but millions of deliveries happen efficiently over time.

In the example, each task simulates blocking I/O with Thread.sleep(500).

If we were to run newFixedThreadPool(100):

Only 100 tasks run concurrently.

1000 tasks ÷ 100 threads = 10 batches × 0.5 s ≈ 5 s total.

If we were to run newVirtualThreadPerTaskExecutor():

All 1,000 tasks run at once.

Every task sleeps for 500 ms concurrently.

Total ≈ 0.5–0.6 s—just the simulated delay, no waiting queue.

Virtual threads drastically reduce overhead by releasing their underlying operating-system threads whenever blocking occurs, allowing vast I/O concurrency with minimal resource cost. Both parallel streams and virtual threads offer performance benefits, but you have to know when to use them. As a rule of thumb:

Use parallel streams for CPU-bound workloads that benefit from data parallelism.

Use virtual threads for I/O-bound tasks where many concurrent operations block on external resources.

Stream gatherers (Java 22+)

Before Java 22, streams were great for stateless transformations like filtering or mapping. But when you needed logic that depended on earlier elements—things like sliding windows, running totals, and conditional grouping—you had to abandon streams entirely and write imperative loops with mutable state. That changed with the introduction of stream gatherers.

Before stream gatherers, let’s say we wanted to calculate a moving average over a sliding window of three elements:

List data = List.of(1, 2, 3, 4, 5, 6);
List movingAverages = new ArrayList();
Deque window = new ArrayDeque();

for (int value: data) {
window.add(value);
if (window.size() > 3) {
window.removeFirst();
}

if (window.size() == 3) { // Only calculate when window is full
double avg = window.stream().mapToInt(Integer::intValue).average().orElse(0.0);
movingAverages.add(avg);
}
}

System.out.println(movingAverages); // [2.0, 3.0, 4.0, 5.0]

This approach works but it breaks the declarative, lazy nature of streams. In this code, we are manually managing state, mixing imperative and functional styles, and we’ve lost composability.

Now consider the same example using Stream.gather() and built-in gatherers. Using stream gatherers lets us perform stateful operations directly inside the stream pipeline while keeping it lazy and readable:

List movingAverages = Stream.of(1, 2, 3, 4, 5, 6).gather(Gatherers.windowSliding(3)).map(window -> window.stream().mapToInt(Integer::intValue).average().orElse(0.0)).toList();

System.out.println(movingAverages); // [2.0, 3.0, 4.0, 5.0]

As you can see, windowSliding(3) waits until it has three elements, then emits [1,2,3] and slides forward by one: [2,3,4], [3,4,5], [4,5,6]. The gatherer manages this state automatically, so we can express complex data flows cleanly without manual buffering or loops.

Built-in gatherers

The Stream Gatherers API includes the following built-in gatherers:

windowFixed(n): Used for non-overlapping batches of n elements.

windowSliding(n): Used to create overlapping windows for moving averages or trend detection.

scan(seed, acc): Used for running totals or cumulative metrics.

mapConcurrent(maxConcurrency, mapper): Supports concurrent mapping with controlled parallelism.

Collectors vs. gatherers

In my introduction to Java streams, you learned about collectors, which serve a similar purpose to gatherers but operate differently. Collectors aggregate the entire stream into one result at the end, such as a list or sum, while gatherers operate during stream processing, maintaining context between elements. An easy way to remember the difference between the two features is that collectors finalize data once, whereas gatherers reshape it as it flows.

Example: Running total with streams gatherers

The following example demonstrates the benefits of stream gatherers:

Stream.of(2, 4, 6, 8).gather(Gatherers.scan(() -> 0, Integer::sum)).forEach(System.out::println);
// 2, 6, 12, 20

Each emitted value includes the cumulative sum so far. The stream remains lazy and free of side-effects.

Like any technology, stream gatherers have their place. Use stream gatherers when the following conditions are true:

The application involves sliding or cumulative analytics.

The application produces metrics or transformations that depend on previous elements.

The operation includes sequence analysis or pattern recognition.

The code requires manual state with clean, declarative logic.

Gatherers restore the full expressive power of Java streams for stateful operations while keeping pipelines readable, efficient, and parallel-friendly.

Combining and zipping streams

Sometimes you need to combine data from multiple streams; an example is merging two sequences element by element. While the Stream API doesn’t yet include a built-in zip() method, you can easily implement one:

import java.util.*;
import java.util.function.BiFunction;
import java.util.stream.*;

public class StreamZipDemo {
public static Stream zip(
Stream a, Stream b, BiFunction combiner) {

Iterator itA = a.iterator();
Iterator itB = b.iterator();

Iterable iterable = () -> new Iterator() {
public boolean hasNext() {
return itA.hasNext() && itB.hasNext();
}
public C next() {
return combiner.apply(itA.next(), itB.next());
}
};

return StreamSupport.stream(iterable.spliterator(), false);
}

// Usage:
public static void main(String[] args) {
zip(Stream.of(1, 2, 3),
Stream.of('Duke', 'Juggy', 'Moby'),
(n, s) -> n + ' → ' + s).forEach(System.out::println);
}
}

The output will be:

1 → Duke
2 → Juggy
3 → Moby

Zipping pairs elements from two streams until one runs out, which is perfect for combining related data sequences.

Pitfalls and best practices with Java streams

We’ll conclude with an overview of pitfalls to avoid when working with streams, and some best practices to enhance streams performance and efficiency.

Pitfalls to avoid when using Java streams

Overusing streams: Not every loop should be a stream.

Side-effects in map/filter: Retain pure functions.

Forgetting terminal operations: Remember that streams are lazy.

Parallel misuse: Helps CPU-bound work but hurts I/O-bound work.

Reusing consumed streams: One traversal only.

Collector misuse: Avoid shared mutable state.

Manual state hacks: Use gatherers instead.

Best practices when using Java streams

To maximize the benefits of Java streams, apply the following best practices:

Keep pipelines small and readable.

Prefer primitive streams for numbers.

Use peek() only for debugging.

Filter early, before expensive ops.

Favor built-in gatherers for stateful logic.

Avoid parallel streams for I/O; use virtual threads instead.

Use the Java Microbenchmark Harness or profilers to measure performance before optimizing your code.

Conclusion

The advanced Java Stream API techniques in this tutorial will help you unlock expressive, high-performance data processing in modern Java. Short-circuiting saves computation; parallel streams use multiple cores; virtual threads handle massive I/O; and gatherers bring stateful transformations without breaking the declarative style in your Java code.

Combine these techniques wisely by testing, measuring, and reasoning about your workload, and your streams will remain concise, scalable, and as smooth as Duke surfing the digital wave!

Now it’s your turn: Take one of the examples in the Java Challengers GitHub repository, tweak it, and run your own benchmarks or experiments. Practice is the real challenge—and that’s how you’ll master modern Java streams.

Lire la suite sur InfoWorld